CS4132 Data Analytics
Wordle was created in mid-2021 (more specifically, the game refers to 20 June 2021 as the first Wordle) by software engineer John Wardle, who created the game for his girlfriend as he knew she liked word games. The couple played for months, and after it became a hit in family group chats, the game was released to the public in October 2021. The game has since become a worldwide sensation, with millions of people playing it worldwide.
The free-to-play online word guessing game involves players having to guess a 5 letter word in 6 tries. It has a single daily solution, and all players attempt to guess the same word. The game was purchased from creater Josh Wardle by The New York Times Company in January 2022.
As someone who plays Wordle on a daily basis, I was intrigued to explore Wordle in a more analytical fashion. In this project, I will look into Wordle's growth (and decline) and popularity, as well as analyze players' posted results.
pip install pyarrow
pip install plotly
pip install statsmodel
import numpy as np
import pandas as pd
import pyarrow
from pyarrow import csv as arcsv, Table
def read_csv(filename, dtype=None):
return arcsv.read_csv(filename,
parse_options = arcsv.ParseOptions(newlines_in_values=True),
convert_options = arcsv.ConvertOptions(column_types=dtype)
).to_pandas()
def write_csv(data, filename):
arcsv.write_csv(Table.from_pandas(data), filename)
Refer to appendix section A.1
read_csv('data/twitter/raw/wordles420.csv').head(20)
| id | date | content | username | userLocation | sourceLabel | |
|---|---|---|---|---|---|---|
| 0 | 1559233311226138628 | 2022-08-15 17:39:25+00:00 | Wordle 420 6/6\n\n⬛🟩⬛⬛⬛\n⬛🟩⬛⬛⬛\n⬛🟩🟩🟩🟩\n⬛🟩🟩🟩🟩\n... | IllanMejido | Tokyo-3 | Twitter Web App |
| 1 | 1559232918035525632 | 2022-08-15 17:37:51+00:00 | Wordle 420 3/6*\n\nLine 1: 3rd perfect.\nLine ... | klchubbuck | Seattle, WA | Twitter Web App |
| 2 | 1559215470867189765 | 2022-08-15 16:28:31+00:00 | Wordle 422 6/6\n\n🟨⬜⬜⬜⬜\n🟩🟩⬜⬜⬜\n🟩🟩⬜🟩🟩\n🟩🟩⬜🟩🟩\n... | nottodaybrother | Grinnell, IA | Twitter Web App |
| 3 | 1559203001620787200 | 2022-08-15 15:38:58+00:00 | Wordle 420 5/6\n\n⬜⬜⬜⬜🟨\n⬜⬜🟩⬜⬜\n⬜⬜🟩🟩🟩\n⬜🟩🟩🟩🟩\n... | jocal3 | Sammamish | Twitter Web App |
| 4 | 1559190764910952450 | 2022-08-15 14:50:21+00:00 | Wordle 420 5/6\n\n⬜⬜⬜🟨⬜\n🟨⬜⬜🟨⬜\n⬜⬜🟩⬜⬜\n⬜🟩🟩⬜🟨\n... | ThePublicHorse | Toronto, Ontario | TweetDeck |
| 5 | 1559175718520688643 | 2022-08-15 13:50:34+00:00 | Wordle 420 4/6\n\n⬛⬛⬛🟨⬛\n⬛⬛🟩⬛🟨\n🟩⬛🟩⬛⬛\n🟩🟩🟩🟩🟩 | Pyrotastic | Miami, FL | Twitter Web App |
| 6 | 1559149586358362113 | 2022-08-15 12:06:43+00:00 | @afterexposure Yeah, your Wordle 420 X/6 was s... | HoosAGoodBoy | Montréal, Québec | Twitter Web App |
| 7 | 1559149366446792704 | 2022-08-15 12:05:51+00:00 | Saturday, 8/13’s\n\nWordle 420 5/6\n\n⬛⬛⬛⬛🟩\n⬛... | ZWordles | Twitter for iPhone | |
| 8 | 1559146814535770113 | 2022-08-15 11:55:42+00:00 | Wordle 420 4/6\n\n⬛⬛⬛⬛⬛\n⬛⬛⬛⬛⬛\n⬛🟩⬛⬛⬛\n🟩🟩🟩🟩🟩 | DaithiKelleher | Dublin, Ireland | Twitter for Android |
| 9 | 1559140700649918464 | 2022-08-15 11:31:25+00:00 | Wordle 420 5/6\n\n⬛🟩⬛⬛⬛\n⬛🟩⬛⬛⬛\n⬛🟩🟩⬛🟩\n⬛🟩🟩🟩🟩\n... | wbourgou | The cold part of Canada | Twitter for Android |
| 10 | 1559115522356441088 | 2022-08-15 09:51:22+00:00 | Wordle 420 5/6\n\n⬛⬛⬛⬛🟨\n⬛⬛⬛🟨⬛ \n⬛🟨🟨⬛🟨\n⬛⬛🟨🟨🟨\... | sykedelic_3406 | India | Twitter Web App |
| 11 | 1559079102048452608 | 2022-08-15 07:26:38+00:00 | Wordle 420 5/6*\n\n⬛⬛🟨⬛⬛\n⬛🟨⬛🟩⬛\n🟩🟩⬛🟩🟩\n🟩🟩⬛🟩🟩\... | manishgant | San Francisco Bay Area | Twitter for iPhone |
| 12 | 1559063471551037441 | 2022-08-15 06:24:32+00:00 | Wordle 420 4/6*\n\n⬛⬛⬛🟨⬛\n⬛🟨🟨⬛⬛\n🟩🟩⬛🟩🟩\n🟩🟩🟩🟩🟩 | Natalie54056779 | Wednesfield | Twitter for Android |
| 13 | 1559053049406144512 | 2022-08-15 05:43:07+00:00 | @cougsgo @JasonPuckett20 @_MargaretLarson Word... | sddebruler | ÜT: 48.453574,-122.324751 | Twitter for Android |
| 14 | 1559023698232987649 | 2022-08-15 03:46:29+00:00 | I haven't posted a #Wordle in a little while, ... | AlexiSargeant | Manhattan, NY | Twitter Web App |
| 15 | 1559018301325512704 | 2022-08-15 03:25:02+00:00 | Current Streak:20\n\nWordle 420 6/6\n\n⬜⬜⬜⬜🟨\n... | SideMoss5126524 | Twitter for iPad | |
| 16 | 1559013629319684097 | 2022-08-15 03:06:28+00:00 | Wordle 420 4/6*\n\n⬛⬛⬛⬛⬛\n⬛🟨⬛⬛⬛\n⬛🟨⬛⬛🟨\n🟩🟩🟩🟩🟩 | AlexGodofsky | Fairfax, VA | Twitter Web App |
| 17 | 1559011999937433600 | 2022-08-15 03:00:00+00:00 | Wordle 420 X/6\n\n⬛⬛⬛⬛🟩\n⬛🟨⬛🟨⬛\n🟩🟩⬛⬛🟩\n🟩🟩⬛⬛🟩\n... | s_kwkm | http://uraraka.work/ | Twitter Web App |
| 18 | 1558998211918012416 | 2022-08-15 02:05:13+00:00 | Well, crap. A lesson not to do wordle while di... | SarahAnnMasse | LA & NYC | Twitter Web App |
| 19 | 1558980696936308736 | 2022-08-15 00:55:37+00:00 | Catch-up! There were 2 Saturday sedecordle sol... | Xanthe_Cat | Australia. Naarm, VIC 3018 | Twitter Web App |
Refer to appendix section A.4
read_csv('data/reddit/comments.csv').head(20)
| author | body | score | created_utc | |
|---|---|---|---|---|
| 0 | lavinient | Scoredle 439 4/6* \n\n14,855 \n⬛⬛🟨⬛⬛ >!SLIME... | 26 | 1662010774 |
| 1 | BlitzAceSamy | Scoredle 439 3/6* \n\n|Guess|Result|Scoredle|... | 11 | 1661943874 |
| 2 | eliw23 | Scoredle 439 5/6* \n\n12,974 \n⬜⬜⬜🟨⬜ >!STAIR... | 9 | 1662007274 |
| 3 | blackbeanqueen | Scoredle 439 3/6* \n\n12,974 \n⬜⬜⬜⬜⬜ >!SLATE... | 8 | 1662065267 |
| 4 | MidnightExcursion | Did you hear about the >!mushroom!< that got k... | 7 | 1662019442 |
| 5 | BlackJPI | Scoredle 439 4/6* \n\n14,855 \n⬜⬜⬜⬜⬜ >!STEAM... | 8 | 1662029435 |
| 6 | cheezislife | Scoredle 439 4/6* \n\n14,855 \n🟩⬛⬛⬛⬛ >!FLASH... | 6 | 1662067855 |
| 7 | HS007 | Scoredle 439 6/6 \n\n14,855 \n* ⬛🟨⬛🟨⬛ >!PIOU... | 5 | 1661972028 |
| 8 | Pinnacle55 | Scoredle 439 3/6* \n\n14,855 \n⬛⬛⬛🟨⬛ >!STONE... | 5 | 1662020156 |
| 9 | Inna_Bien | >!words that end with i are a special kind of ... | 5 | 1662022149 |
| 10 | shombled | Scoredle 439 3/6* \n\n14,855 \n⬛⬛⬛⬛⬛ >!STARE... | 4 | 1662006454 |
| 11 | LadeeAlana | People say that I'm kind of a >!FUN GUY!< myself. | 3 | 1662021326 |
| 12 | TaurAnder | Scoredle 439 3/6* \n\n14,855 \n* 🟩⬛🟨⬛⬛ >!FLI... | 4 | 1662022421 |
| 13 | cyrano4833 | Scoredle 439 5/6* \n\n12,974 \n⬜⬜⬜⬜⬜ >!STORE... | 3 | 1662028951 |
| 14 | oglabradoodle | Scoredle 439 4/6* \n\n12,974 \n⬜⬜⬜⬜⬜ >!STEAM... | 5 | 1662040052 |
| 15 | a-little-bit-this | Scoredle 439 4/6* \n\n14,855 \n* ⬜⬜⬜🟨⬜ >!ALO... | 3 | 1662054311 |
| 16 | mohamedzu | Scoredle 439 3/6* \n\n14,855 \n🟩🟨🟨⬛⬛ >!FIGHT... | 4 | 1662055968 |
| 17 | KingD123 | Scoredle 439 3/6* \n\n12,974 \n⬛⬛⬛🟨⬛ >!PLANK... | 4 | 1662068789 |
| 18 | soleilady | Scoredle 439 4/6* \n\n12,974 \n⬜⬜🟨⬜⬜ >!ARISE... | 3 | 1662072125 |
| 19 | calscigal | Scoredle 439 3/6* \n\n12,974 \n* ⬛⬛⬛⬛⬛ >!CHA... | 4 | 1662075325 |
pd.read_csv('data/trends/trends.csv').head(20)
| date | Wordle | |
|---|---|---|
| 0 | 2021-12-26 | 0 |
| 1 | 2021-12-27 | 0 |
| 2 | 2021-12-28 | 0 |
| 3 | 2021-12-29 | 0 |
| 4 | 2021-12-30 | 0 |
| 5 | 2021-12-31 | 1 |
| 6 | 2022-01-01 | 1 |
| 7 | 2022-01-02 | 1 |
| 8 | 2022-01-03 | 1 |
| 9 | 2022-01-04 | 4 |
| 10 | 2022-01-05 | 3 |
| 11 | 2022-01-06 | 4 |
| 12 | 2022-01-07 | 6 |
| 13 | 2022-01-08 | 7 |
| 14 | 2022-01-09 | 8 |
| 15 | 2022-01-10 | 8 |
| 16 | 2022-01-11 | 10 |
| 17 | 2022-01-12 | 14 |
| 18 | 2022-01-13 | 14 |
| 19 | 2022-01-14 | 15 |
pd.read_csv('data/trends/regional.csv').head(20)
| geoName | geoCode | Wordle | |
|---|---|---|---|
| 0 | Afghanistan | AF | 0 |
| 1 | Albania | AL | 2 |
| 2 | Algeria | DZ | 0 |
| 3 | American Samoa | AS | 0 |
| 4 | Andorra | AD | 17 |
| 5 | Angola | AO | 0 |
| 6 | Anguilla | AI | 0 |
| 7 | Antarctica | AQ | 0 |
| 8 | Antigua & Barbuda | AG | 15 |
| 9 | Argentina | AR | 4 |
| 10 | Armenia | AM | 1 |
| 11 | Aruba | AW | 24 |
| 12 | Australia | AU | 55 |
| 13 | Austria | AT | 5 |
| 14 | Azerbaijan | AZ | 1 |
| 15 | Bahamas | BS | 14 |
| 16 | Bahrain | BH | 8 |
| 17 | Bangladesh | BD | 2 |
| 18 | Barbados | BB | 17 |
| 19 | Belarus | BY | 0 |
Refer to appendix section A.3.1
pd.read_csv('data/words/answers.csv', index_col='n').head(20)
| answers | date | |
|---|---|---|
| n | ||
| 179 | trace | 2021-12-15 |
| 180 | using | 2021-12-16 |
| 181 | peach | 2021-12-17 |
| 182 | champ | 2021-12-18 |
| 183 | baton | 2021-12-19 |
| 184 | brake | 2021-12-20 |
| 185 | pluck | 2021-12-21 |
| 186 | craze | 2021-12-22 |
| 187 | gripe | 2021-12-23 |
| 188 | weary | 2021-12-24 |
| 189 | picky | 2021-12-25 |
| 190 | acute | 2021-12-26 |
| 191 | ferry | 2021-12-27 |
| 192 | aside | 2021-12-28 |
| 193 | tapir | 2021-12-29 |
| 194 | troll | 2021-12-30 |
| 195 | unify | 2021-12-31 |
| 196 | rebus | 2022-01-01 |
| 197 | boost | 2022-01-02 |
| 198 | truss | 2022-01-03 |
Refer to appendix section A.3.2
pd.read_csv('data/words/freq.csv', index_col=0).head(20)
| answers | freq | zipf | |
|---|---|---|---|
| n | |||
| 179 | trace | 1.550000e-05 | 4.19 |
| 180 | using | 2.950000e-04 | 5.47 |
| 181 | peach | 5.890000e-06 | 3.77 |
| 182 | champ | 7.940000e-06 | 3.90 |
| 183 | baton | 4.370000e-06 | 3.64 |
| 184 | brake | 8.910000e-06 | 3.95 |
| 185 | pluck | 1.350000e-06 | 3.13 |
| 186 | craze | 2.240000e-06 | 3.35 |
| 187 | gripe | 6.460000e-07 | 2.81 |
| 188 | weary | 4.070000e-06 | 3.61 |
| 189 | picky | 2.190000e-06 | 3.34 |
| 190 | acute | 1.000000e-05 | 4.00 |
| 191 | ferry | 1.050000e-05 | 4.02 |
| 192 | aside | 3.890000e-05 | 4.59 |
| 193 | tapir | 1.950000e-07 | 2.29 |
| 194 | troll | 5.370000e-06 | 3.73 |
| 195 | unify | 1.170000e-06 | 3.07 |
| 196 | rebus | 3.390000e-07 | 2.53 |
| 197 | boost | 2.570000e-05 | 4.41 |
| 198 | truss | 1.950000e-06 | 3.29 |
pd.read_csv('data/geo/iso.csv', index_col=0).head(20)
| alpha2 | alpha3 | |
|---|---|---|
| numeric | ||
| 4 | AF | AFG |
| 248 | AX | ALA |
| 8 | AL | ALB |
| 12 | DZ | DZA |
| 16 | AS | ASM |
| 20 | AD | AND |
| 24 | AO | AGO |
| 660 | AI | AIA |
| 10 | AQ | ATA |
| 28 | AG | ATG |
| 32 | AR | ARG |
| 51 | AM | ARM |
| 533 | AW | ABW |
| 36 | AU | AUS |
| 40 | AT | AUT |
| 31 | AZ | AZE |
| 44 | BS | BHS |
| 48 | BH | BHR |
| 50 | BD | BGD |
| 52 | BB | BRB |
Refer to appendix section A.2.2
https://en.wikipedia.org/wiki/List_of_alternative_country_names
pd.read_csv('data/geo/countries.csv').head(20)
| numeric | name | |
|---|---|---|
| 0 | 4 | Afghanistan |
| 1 | 24 | Angola |
| 2 | 24 | Republic of Angola |
| 3 | 24 | República de Angola |
| 4 | 8 | Albania |
| 5 | 8 | Republic of Albania |
| 6 | 8 | Republika e Shqipërisë |
| 7 | 20 | Andorra |
| 8 | 20 | Principality of Andorra |
| 9 | 20 | Principat d'Andorra |
| 10 | 784 | Al Emirat al Arabbiya al Muttahida |
| 11 | 784 | The Emirates |
| 12 | 784 | Trucial Arabia |
| 13 | 784 | United Arab Emirates |
| 14 | 32 | Argentina |
| 15 | 32 | Argentine Nation |
| 16 | 32 | Argentine Republic |
| 17 | 32 | la Argentina |
| 18 | 51 | Armenia |
| 19 | 51 | Hayastani Hanrapetut’yun |
Refer to appendix section A.2.3
pd.read_csv('data/geo/flags.csv').head(20)
| numeric | flag | |
|---|---|---|
| 0 | 4 | 🇦🇫 |
| 1 | 248 | 🇦🇽 |
| 2 | 8 | 🇦🇱 |
| 3 | 12 | 🇩🇿 |
| 4 | 16 | 🇦🇸 |
| 5 | 20 | 🇦🇩 |
| 6 | 24 | 🇦🇴 |
| 7 | 660 | 🇦🇮 |
| 8 | 10 | 🇦🇶 |
| 9 | 28 | 🇦🇬 |
| 10 | 32 | 🇦🇷 |
| 11 | 51 | 🇦🇲 |
| 12 | 533 | 🇦🇼 |
| 13 | 36 | 🇦🇺 |
| 14 | 40 | 🇦🇹 |
| 15 | 31 | 🇦🇿 |
| 16 | 44 | 🇧🇸 |
| 17 | 48 | 🇧🇭 |
| 18 | 50 | 🇧🇩 |
| 19 | 52 | 🇧🇧 |
https://ourworldindata.org/grapher/number-of-internet-users-by-country
pd.read_csv('data/geo/internet_users.csv').head(20)
| Entity | Code | Year | Number of internet users (OWID based on WB & UN) | |
|---|---|---|---|---|
| 0 | Afghanistan | AFG | 1990 | 0 |
| 1 | Afghanistan | AFG | 2001 | 990 |
| 2 | Afghanistan | AFG | 2002 | 1003 |
| 3 | Afghanistan | AFG | 2003 | 20272 |
| 4 | Afghanistan | AFG | 2004 | 25520 |
| 5 | Afghanistan | AFG | 2005 | 306904 |
| 6 | Afghanistan | AFG | 2006 | 545607 |
| 7 | Afghanistan | AFG | 2007 | 505719 |
| 8 | Afghanistan | AFG | 2008 | 502210 |
| 9 | Afghanistan | AFG | 2009 | 994154 |
| 10 | Afghanistan | AFG | 2010 | 1152127 |
| 11 | Afghanistan | AFG | 2011 | 1485430 |
| 12 | Afghanistan | AFG | 2012 | 1674380 |
| 13 | Afghanistan | AFG | 2013 | 1872170 |
| 14 | Afghanistan | AFG | 2014 | 2293061 |
| 15 | Afghanistan | AFG | 2015 | 2786634 |
| 16 | Afghanistan | AFG | 2016 | 3672058 |
| 17 | Albania | ALB | 1990 | 0 |
| 18 | Albania | ALB | 1995 | 347 |
| 19 | Albania | ALB | 1996 | 996 |
pd.read_csv('data/geo/subcountries.csv').head(20)
| name | country | subcountry | geonameid | |
|---|---|---|---|---|
| 0 | les Escaldes | Andorra | Escaldes-Engordany | 3040051 |
| 1 | Andorra la Vella | Andorra | Andorra la Vella | 3041563 |
| 2 | Umm al Qaywayn | United Arab Emirates | Umm al Qaywayn | 290594 |
| 3 | Ras al-Khaimah | United Arab Emirates | Raʼs al Khaymah | 291074 |
| 4 | Khawr Fakkān | United Arab Emirates | Ash Shāriqah | 291696 |
| 5 | Dubai | United Arab Emirates | Dubai | 292223 |
| 6 | Dibba Al-Fujairah | United Arab Emirates | Al Fujayrah | 292231 |
| 7 | Dibba Al-Hisn | United Arab Emirates | Al Fujayrah | 292239 |
| 8 | Sharjah | United Arab Emirates | Ash Shāriqah | 292672 |
| 9 | Ar Ruways | United Arab Emirates | Abu Dhabi | 292688 |
| 10 | Al Fujayrah | United Arab Emirates | Al Fujayrah | 292878 |
| 11 | Al Ain | United Arab Emirates | Abu Dhabi | 292913 |
| 12 | Ajman | United Arab Emirates | Ajman | 292932 |
| 13 | Adh Dhayd | United Arab Emirates | Ash Shāriqah | 292953 |
| 14 | Abu Dhabi | United Arab Emirates | Abu Dhabi | 292968 |
| 15 | Zaranj | Afghanistan | Nīmrūz | 1120985 |
| 16 | Taloqan | Afghanistan | Takhār | 1123004 |
| 17 | Shīnḏanḏ | Afghanistan | Herat | 1125155 |
| 18 | Shibirghān | Afghanistan | Jowzjān | 1125444 |
| 19 | Shahrak | Afghanistan | Ghowr | 1125896 |
pd.read_csv('data/geo/cities.csv').head(20)
| rank | Name | Country | Population | Prev | Growth | |
|---|---|---|---|---|---|---|
| 0 | 1 | Tokyo | Japan | 37274000 | 37339804 | -0.0018 |
| 1 | 2 | Delhi | India | 32065760 | 31181376 | 0.0284 |
| 2 | 3 | Shanghai | China | 28516904 | 27795702 | 0.0259 |
| 3 | 4 | Dhaka | Bangladesh | 22478116 | 21741090 | 0.0339 |
| 4 | 5 | Sao Paulo | Brazil | 22429800 | 22237472 | 0.0086 |
| 5 | 6 | Mexico City | Mexico | 22085140 | 21918936 | 0.0076 |
| 6 | 7 | Cairo | Egypt | 21750020 | 21322750 | 0.0200 |
| 7 | 8 | Beijing | China | 21333332 | 20896820 | 0.0209 |
| 8 | 9 | Mumbai | India | 20961472 | 20667656 | 0.0142 |
| 9 | 10 | Osaka | Japan | 19059856 | 19110616 | -0.0027 |
| 10 | 11 | Chongqing | China | 16874740 | 16382376 | 0.0301 |
| 11 | 12 | Karachi | Pakistan | 16839950 | 16459472 | 0.0231 |
| 12 | 13 | Istanbul | Turkey | 15636243 | 15415197 | 0.0143 |
| 13 | 14 | Kinshasa | DR Congo | 15628085 | 14970460 | 0.0439 |
| 14 | 15 | Lagos | Nigeria | 15387639 | 14862111 | 0.0354 |
| 15 | 16 | Buenos Aires | Argentina | 15369919 | 15257673 | 0.0074 |
| 16 | 17 | Kolkata | India | 15133888 | 14974073 | 0.0107 |
| 17 | 18 | Manila | Philippines | 14406059 | 14158573 | 0.0175 |
| 18 | 19 | Tianjin | China | 14011828 | 13794450 | 0.0158 |
| 19 | 20 | Guangzhou | China | 13964637 | 13635397 | 0.0241 |
https://worldpopulationreview.com/states/state-abbreviations
pd.read_csv('data/geo/states.csv').head(20)
| State | Abbrev | Code | |
|---|---|---|---|
| 0 | Alabama | Ala. | AL |
| 1 | Alaska | Alaska | AK |
| 2 | Arizona | Ariz. | AZ |
| 3 | Arkansas | Ark. | AR |
| 4 | California | Calif. | CA |
| 5 | Colorado | Colo. | CO |
| 6 | Connecticut | Conn. | CT |
| 7 | Delaware | Del. | DE |
| 8 | District of Columbia | D.C. | DC |
| 9 | Florida | Fla. | FL |
| 10 | Georgia | Ga. | GA |
| 11 | Hawaii | Hawaii | HI |
| 12 | Idaho | Idaho | ID |
| 13 | Illinois | Ill. | IL |
| 14 | Indiana | Ind. | IN |
| 15 | Iowa | Iowa | IA |
| 16 | Kansas | Kans. | KS |
| 17 | Kentucky | Ky. | KY |
| 18 | Louisiana | La. | LA |
| 19 | Maine | Maine | ME |
import re
When a game of wordle is completed, the player is able to easily share their results by copying a grid of emoji squares generated by the game. This allows players to share their game with others without spoiling it. These cryptic emoji squares have been shared all over the internet, especially on Twitter, as seen in the tweets below.
Wordle 420 X/6
— 🎃Hogueus Woahcus 🎃 (@HogueLikeWoah) August 14, 2022
⬛🟧⬛⬛⬛
⬛🟧⬛⬛⬛
⬛🟧⬛🟦🟧
🟧🟧⬛⬛🟧
🟧🟧⬛⬛🟧
🟧🟧⬛🟧🟧
this was so rude
Again with the rhyming game. I lucked out this time!#Wordle420 4/6*
— Gayla (@AZGayla) August 13, 2022
⬛⬛⬛🟨⬛
⬛⬛🟩⬛🟩
⬛🟩🟩🟩🟩
🟩🟩🟩🟩🟩 pic.twitter.com/cySnOsjYSg
The shared text follows a standard structure, with a header displaying the game number and score, followed by the emoji grid. Each row represents a guess made by the user and the emojis show how good the guess was. The header also includes an asterisk when playing on hard mode, and the emoji colours depend on whether the user is on dark or light theme, and whether they have high contrast mode on.
# process 1 day of tweets and write to file
def process_twitter(n):
posts = read_csv(f'data/twitter/raw/wordles{n}.csv')# Load data
# regex to match shared wordle games
regex = ''.join([
# match header
rf'Wordle[^\n]*(?P<wordle>{n})[^\n]+(?P<score>[0-6X])\/6(?P<hard>\*)?',
# (not) looking for links as many wordle "clones" link to their website
r'(?![^🟥🟧🟨🟩🟦🟪⬛⬜🟫]*wordle[\w\-\.]*\.[a-z]{2,})',
r'[^🟥🟧🟨🟩🟦🟪⬛⬜🟫]*',
# emoji square grid
r'(?P<grid>(?:[^🟥🟧🟨🟩🟦🟪⬛⬜🟫\n]*[🟨🟩🟧🟦⬛⬜]{5}[^🟥🟧🟨🟩🟦🟪⬛⬜🟫\n]*\n?){1,6})',
r'(?!\n?[🟥🟧🟨🟩🟦🟪⬛⬜🟫])',
# (not) looking for links
r'(?![^🟥🟧🟨🟩🟦🟪⬛⬜🟫\/]*wordle[\w\-\.]*\.[a-z]{2,})'
])
# remove non-standard characters to allow proper matching
# such as invisible characters like U+FE0F "VARIATION SELECTOR-16"
# which often follow some emojis
scores = posts['content'].str.replace(
r'[^\w\s\n\/\.\-\*🟥🟧🟨🟩🟦🟪⬛⬜🟫]','', regex=True
).str.extract(regex, flags=re.IGNORECASE) # run regex extraction
# return if their are no matches in the tweets
if scores.isna().all().all():
return
# copy over data from raw posts data to the regex extracted data
scores[['location','source']] = posts[['userLocation','sourceLabel']].astype('string')
scores['username'] = posts['username'].astype('string')
# remove non matches
scores = scores.dropna(subset='wordle').reset_index(drop=True)
# type conversion to save memory
scores['wordle'] = scores['wordle'].astype('int16')
scores['score'] = scores['score'].replace(['x','X'],0).astype('int8')
scores['hard'] = scores['hard'] == '*'
scores['dark'] = ~scores['grid'].str.contains('⬜')
scores['contrast'] = ~scores['grid'].str.contains('🟩')
grid = scores[scores['grid'].notnull()]['grid'].str.split(
'[^🟥🟧🟨🟩🟦🟪⬛⬜🟫]+', expand=True,n=6 # separate each row of emojis
).replace('',pd.NA).reindex( # Convert all blank values to <NA>
columns = range(0,6)
).astype('string')
# converting the emoji rows to base3 number to reduce memory
emojis = {
'⬛':0, '⬜':0, # absent letter
'🟦':1, '🟨':1, # present letter
'🟧':2, '🟩':2 # correct letter
}
# convert emojis to base3
def encodeGuess(guess):
ans = 0
if(pd.notna(guess)):
for n in map(emojis.get, guess):
ans = ans * 3 + n
ans += 1
return ans
# iterate through each column
def encodeColumn(col):
col[col.str.contains('🟩').fillna(False)].str.replace(
'🟧','🟨',regex=False # replace non standard emoji use
)
return col.apply(encodeGuess)
grid = grid.apply(encodeColumn).astype('uint8')
scores = pd.concat([scores, grid], axis=1).drop(columns='grid')
# stop pyarrow from complaining about mixed type column names
scores.columns = scores.columns.astype(str)
# save to csv
write_csv(scores, f'data/twitter/processed/wordles{n}.csv')
geo_iso = pd.read_csv('data/geo/iso.csv', index_col=0)
geo_iso.head(20)
| alpha2 | alpha3 | |
|---|---|---|
| numeric | ||
| 4 | AF | AFG |
| 248 | AX | ALA |
| 8 | AL | ALB |
| 12 | DZ | DZA |
| 16 | AS | ASM |
| 20 | AD | AND |
| 24 | AO | AGO |
| 660 | AI | AIA |
| 10 | AQ | ATA |
| 28 | AG | ATG |
| 32 | AR | ARG |
| 51 | AM | ARM |
| 533 | AW | ABW |
| 36 | AU | AUS |
| 40 | AT | AUT |
| 31 | AZ | AZE |
| 44 | BS | BHS |
| 48 | BH | BHR |
| 50 | BD | BGD |
| 52 | BB | BRB |
geo_internet = pd.read_csv('data/geo/internet_users.csv')
# Reduce dataset to 1 value per country
geo_internet = geo_internet.groupby('Code').max()[
['Entity','Number of internet users (OWID based on WB & UN)']
]
# Convert alpha3 code to numeric code
geo_internet = pd.merge(
geo_iso['alpha3'].reset_index(),
geo_internet,
left_on='alpha3',
right_index=True,
how='right'
).drop('alpha3', axis=1).reset_index(drop=True)
# Rename columns
geo_internet.columns = ['numeric','country', 'users']
geo_internet.info()
geo_internet[geo_internet['numeric'].isnull()]
<class 'pandas.core.frame.DataFrame'> RangeIndex: 192 entries, 0 to 191 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 numeric 191 non-null float64 1 country 192 non-null object 2 users 192 non-null int64 dtypes: float64(1), int64(1), object(1) memory usage: 4.6+ KB
| numeric | country | users | |
|---|---|---|---|
| 133 | NaN | World | 3419398061 |
# Drop 'World' as it is not a country
geo_internet.dropna(inplace=True)
# Downcast the numeric code (from float64) to int16
geo_internet['numeric'] = pd.to_numeric(geo_internet['numeric'], downcast='integer')
geo_internet.info()
geo_internet.head(20)
<class 'pandas.core.frame.DataFrame'> Int64Index: 191 entries, 0 to 191 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 numeric 191 non-null int16 1 country 191 non-null object 2 users 191 non-null int64 dtypes: int16(1), int64(1), object(1) memory usage: 4.8+ KB
| numeric | country | users | |
|---|---|---|---|
| 0 | 533 | Aruba | 98053 |
| 1 | 4 | Afghanistan | 3672058 |
| 2 | 24 | Angola | 3745750 |
| 3 | 8 | Albania | 1942025 |
| 4 | 784 | United Arab Emirates | 8913210 |
| 5 | 32 | Argentina | 31118074 |
| 6 | 51 | Armenia | 1882003 |
| 7 | 28 | Antigua and Barbuda | 73703 |
| 8 | 36 | Australia | 21288325 |
| 9 | 40 | Austria | 7681572 |
| 10 | 31 | Azerbaijan | 7763795 |
| 11 | 108 | Burundi | 544479 |
| 12 | 56 | Belgium | 10021205 |
| 13 | 204 | Benin | 1304029 |
| 14 | 854 | Burkina Faso | 2603042 |
| 15 | 50 | Bangladesh | 29733670 |
| 16 | 100 | Bulgaria | 4492334 |
| 17 | 48 | Bahrain | 1431062 |
| 18 | 44 | Bahamas | 312986 |
| 19 | 70 | Bosnia and Herzegovina | 2437067 |
geo_countries = pd.read_csv('data/geo/countries.csv')
# Sort country names by number of internet users
geo_countries = geo_countries.merge(
# Combine with geo_internet to do sort
geo_internet[['numeric','users']],
on='numeric',
how='left'
).fillna(0).sort_values( # do sort
'users', ascending=False
).drop('users', axis=1).reset_index(drop=True)
geo_countries.head(20)
| numeric | name | |
|---|---|---|
| 0 | 156 | 中华人民共和国 |
| 1 | 156 | Mainland China |
| 2 | 156 | Red China |
| 3 | 156 | People's Republic of China |
| 4 | 156 | PRC |
| 5 | 156 | 中国內地 |
| 6 | 156 | 中国大陆 |
| 7 | 156 | 中国 |
| 8 | 156 | New China |
| 9 | 156 | Communist China |
| 10 | 156 | China |
| 11 | 156 | 新中国 |
| 12 | 156 | 中共 |
| 13 | 356 | भारतवर्ष |
| 14 | 356 | ഭാരതം |
| 15 | 356 | भारत |
| 16 | 356 | Republic of India |
| 17 | 356 | India |
| 18 | 356 | பாரதம் |
| 19 | 356 | Bhāratam भारतम् |
geo_flags = pd.read_csv('data/geo/flags.csv')
# Sort flag emojis by number of internet users
# same process as for country names
geo_flags = geo_flags.merge(
geo_internet[['numeric','users']],
on='numeric', how='left'
).fillna(0).sort_values(
'users', ascending=False
).drop(
'users', axis=1
).reset_index(drop=True)
geo_flags.head(20)
| numeric | flag | |
|---|---|---|
| 0 | 156 | 🇨🇳 |
| 1 | 356 | 🇮🇳 |
| 2 | 840 | 🇺🇸 |
| 3 | 76 | 🇧🇷 |
| 4 | 392 | 🇯🇵 |
| 5 | 643 | 🇷🇺 |
| 6 | 360 | 🇮🇩 |
| 7 | 484 | 🇲🇽 |
| 8 | 276 | 🇩🇪 |
| 9 | 826 | 🇬🇧 |
| 10 | 608 | 🇵🇭 |
| 11 | 250 | 🇫🇷 |
| 12 | 792 | 🇹🇷 |
| 13 | 364 | 🇮🇷 |
| 14 | 410 | 🇰🇷 |
| 15 | 566 | 🇳🇬 |
| 16 | 704 | 🇻🇳 |
| 17 | 818 | 🇪🇬 |
| 18 | 724 | 🇪🇸 |
| 19 | 764 | 🇹🇭 |
geo_subcountries = pd.read_csv('data/geo/subcountries.csv').dropna()
geo_subcountries = pd.DataFrame(
# Get all unique pairs of country and subcountry
list(set(zip(
geo_subcountries['country'],
geo_subcountries['subcountry']
))),
columns=['country','subcountry']
).merge(
# Get numeric code from geo_countries
geo_countries, left_on='country', right_on='name'
).drop('name',axis=1)
geo_subcountries.head(20)
| country | subcountry | numeric | |
|---|---|---|---|
| 0 | Russia | Tula | 643 |
| 1 | Russia | Kabardino-Balkariya | 643 |
| 2 | Russia | Kamtsjatka | 643 |
| 3 | Russia | Tatarstan | 643 |
| 4 | Russia | Vologda | 643 |
| 5 | Russia | Arkhangelskaya | 643 |
| 6 | Russia | Brjansk | 643 |
| 7 | Russia | Lipetsk | 643 |
| 8 | Russia | Rostov | 643 |
| 9 | Russia | Smolensk | 643 |
| 10 | Russia | Nenetskiy Avtonomnyy Okrug | 643 |
| 11 | Russia | Karachayevo-Cherkesiya | 643 |
| 12 | Russia | Sverdlovsk | 643 |
| 13 | Russia | Kursk | 643 |
| 14 | Russia | Omsk | 643 |
| 15 | Russia | Stavropol'skiy | 643 |
| 16 | Russia | Respublika Buryatiya | 643 |
| 17 | Russia | Jaroslavl | 643 |
| 18 | Russia | Mariy-El | 643 |
| 19 | Russia | Krasnodarskiy | 643 |
geo_cities = pd.read_csv('data/geo/cities.csv')
geo_cities.loc[geo_cities['Country']=='DR Congo', 'Country'] = 'Congo' # Special case
geo_cities = geo_cities.merge(
# Get numeric code from geo_countries
geo_countries, left_on='Country', right_on='name', how='inner'
).drop('name',axis=1).sort_values(
# Sort by population size
by='Population', ascending=False
# drop NA values, such as for regions which are not countries
).reset_index(drop=True).dropna()
geo_cities.head(20)
| rank | Name | Country | Population | Prev | Growth | numeric | |
|---|---|---|---|---|---|---|---|
| 0 | 1 | Tokyo | Japan | 37274000 | 37339804 | -0.0018 | 392 |
| 1 | 2 | Delhi | India | 32065760 | 31181376 | 0.0284 | 356 |
| 2 | 3 | Shanghai | China | 28516904 | 27795702 | 0.0259 | 156 |
| 3 | 4 | Dhaka | Bangladesh | 22478116 | 21741090 | 0.0339 | 50 |
| 4 | 5 | Sao Paulo | Brazil | 22429800 | 22237472 | 0.0086 | 76 |
| 5 | 6 | Mexico City | Mexico | 22085140 | 21918936 | 0.0076 | 484 |
| 6 | 7 | Cairo | Egypt | 21750020 | 21322750 | 0.0200 | 818 |
| 7 | 8 | Beijing | China | 21333332 | 20896820 | 0.0209 | 156 |
| 8 | 9 | Mumbai | India | 20961472 | 20667656 | 0.0142 | 356 |
| 9 | 10 | Osaka | Japan | 19059856 | 19110616 | -0.0027 | 392 |
| 10 | 11 | Chongqing | China | 16874740 | 16382376 | 0.0301 | 156 |
| 11 | 12 | Karachi | Pakistan | 16839950 | 16459472 | 0.0231 | 586 |
| 12 | 13 | Istanbul | Turkey | 15636243 | 15415197 | 0.0143 | 792 |
| 13 | 14 | Kinshasa | Congo | 15628085 | 14970460 | 0.0439 | 178 |
| 14 | 14 | Kinshasa | Congo | 15628085 | 14970460 | 0.0439 | 178 |
| 15 | 15 | Lagos | Nigeria | 15387639 | 14862111 | 0.0354 | 566 |
| 16 | 16 | Buenos Aires | Argentina | 15369919 | 15257673 | 0.0074 | 32 |
| 17 | 17 | Kolkata | India | 15133888 | 14974073 | 0.0107 | 356 |
| 18 | 18 | Manila | Philippines | 14406059 | 14158573 | 0.0175 | 608 |
| 19 | 19 | Tianjin | China | 14011828 | 13794450 | 0.0158 | 156 |
geo_states = pd.read_csv('data/geo/states.csv')
geo_states.head(20)
| State | Abbrev | Code | |
|---|---|---|---|
| 0 | Alabama | Ala. | AL |
| 1 | Alaska | Alaska | AK |
| 2 | Arizona | Ariz. | AZ |
| 3 | Arkansas | Ark. | AR |
| 4 | California | Calif. | CA |
| 5 | Colorado | Colo. | CO |
| 6 | Connecticut | Conn. | CT |
| 7 | Delaware | Del. | DE |
| 8 | District of Columbia | D.C. | DC |
| 9 | Florida | Fla. | FL |
| 10 | Georgia | Ga. | GA |
| 11 | Hawaii | Hawaii | HI |
| 12 | Idaho | Idaho | ID |
| 13 | Illinois | Ill. | IL |
| 14 | Indiana | Ind. | IN |
| 15 | Iowa | Iowa | IA |
| 16 | Kansas | Kans. | KS |
| 17 | Kentucky | Ky. | KY |
| 18 | Louisiana | La. | LA |
| 19 | Maine | Maine | ME |
The location information of twitter users can be extracted by the location field of their profile. However, not all users opt to specify their location. In addition, as it is just a text field, not all users actually put their location, and it is much harder to accurately determine their location.
dtypes = pd.read_csv('data/twitter/processed/dtypes.csv', index_col=0).iloc[:,0].to_dict()
df_twitter = pyarrow.concat_tables([arcsv.read_csv(f'data/twitter/processed/wordles{n}.csv',
parse_options = arcsv.ParseOptions(newlines_in_values=True),
convert_options = arcsv.ConvertOptions(column_types=dtypes)
) for n in range(190,440)]).to_pandas()
sources = pd.Series(df_twitter['source'].value_counts(sort=True).index)
df_twitter['source'] = pd.to_numeric(df_twitter['source'].map(pd.Series(sources.index, index=sources)), downcast='integer')
df_twitter.head(20)
| wordle | score | hard | location | source | username | dark | contrast | 0 | 1 | 2 | 3 | 4 | 5 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 190 | 4 | False | 1 | cafehearts | True | False | 91 | 33 | 166 | 243 | 0 | 0 | |
| 1 | 190 | 4 | False | New Zealand | 0 | MJMcVeigh | False | False | 40 | 112 | 220 | 243 | 0 | 0 |
| 2 | 190 | 4 | False | lenape territory / nyc | 1 | jmarieray | False | False | 121 | 94 | 229 | 243 | 0 | 0 |
| 3 | 190 | 5 | False | Seaton Carew, England | 4 | Olucaron | False | False | 91 | 41 | 121 | 229 | 243 | 0 |
| 4 | 191 | 2 | False | Wellington, New Zealand | 1 | BarristerNZ | True | False | 27 | 243 | 0 | 0 | 0 | 0 |
| 5 | 191 | 5 | False | 💉💉💉 on Ngunnawal country | 1 | KBCanB | False | False | 7 | 90 | 81 | 81 | 243 | 0 |
| 6 | 191 | 6 | False | New Zealand | 0 | MJMcVeigh | False | False | 136 | 56 | 73 | 237 | 237 | 243 |
| 7 | 191 | 4 | False | 1 | cafehearts | True | False | 8 | 1 | 180 | 243 | 0 | 0 | |
| 8 | 192 | 3 | False | Dallas, TX | 1 | KevinPayravi | True | False | 12 | 165 | 243 | 0 | 0 | 0 |
| 9 | 192 | 4 | False | 방 의자 | 2 | onezero42O | False | False | 109 | 12 | 222 | 243 | 0 | 0 |
| 10 | 192 | 5 | False | Canberra | 12 | Chriplodocus | True | False | 38 | 94 | 41 | 168 | 243 | 0 |
| 11 | 192 | 0 | False | 1 | aquaebn | False | False | 2 | 32 | 8 | 32 | 6 | 1 | |
| 12 | 192 | 3 | False | Auckland Central, Auckland | 3 | kathyscott25 | False | False | 94 | 53 | 243 | 0 | 0 | 0 |
| 13 | 192 | 5 | False | 1 | alejandroechev_ | False | False | 28 | 94 | 220 | 220 | 243 | 0 | |
| 14 | 192 | 5 | False | Santiago, Chile | 0 | jpalz | True | False | 32 | 13 | 109 | 220 | 243 | 0 |
| 15 | 192 | 4 | False | San Diego | 2 | badalj | False | False | 10 | 47 | 21 | 243 | 0 | 0 |
| 16 | 192 | 3 | False | crossing a bridge | 2 | semirose | False | False | 102 | 24 | 243 | 0 | 0 | 0 |
| 17 | 192 | 5 | False | australia • they/them • bi | 0 | eboyuchiha | True | False | 93 | 193 | 194 | 189 | 243 | 0 |
| 18 | 192 | 4 | False | 065 067 066 | 1 | CanDoMarxist | True | False | 30 | 183 | 186 | 243 | 0 | 0 |
| 19 | 192 | 2 | False | Victoria, Australia | 1 | crazythainame | False | False | 93 | 243 | 0 | 0 | 0 | 0 |
# Get all unique locations of twitter users
twt_locations = pd.DataFrame(df_twitter['location'].dropna().str.strip().unique(), columns=['location'])
twt_locations.head(20)
| location | |
|---|---|
| 0 | |
| 1 | New Zealand |
| 2 | lenape territory / nyc |
| 3 | Seaton Carew, England |
| 4 | Wellington, New Zealand |
| 5 | 💉💉💉 on Ngunnawal country |
| 6 | Dallas, TX |
| 7 | 방 의자 |
| 8 | Canberra |
| 9 | Auckland Central, Auckland |
| 10 | Santiago, Chile |
| 11 | San Diego |
| 12 | crossing a bridge |
| 13 | australia • they/them • bi |
| 14 | 065 067 066 |
| 15 | Victoria, Australia |
| 16 | A couple of steps off the pace |
| 17 | Warrnambool, Victoria |
| 18 | Hobart, Tasmania |
| 19 | Melbourne, Australia |
# combine all the dataframes used to match terms to countries
geo_names = pd.concat([
geo_countries,
# renaming columns for consistency & concat
geo_subcountries.rename(columns={'subcountry':'name'})[['numeric','name']],
geo_cities.rename(columns={'Name':'name'})[['numeric','name']]
])
geo_names['name'] = geo_names['name'].str.lower()
geo_names.head(20)
| numeric | name | |
|---|---|---|
| 0 | 156 | 中华人民共和国 |
| 1 | 156 | mainland china |
| 2 | 156 | red china |
| 3 | 156 | people's republic of china |
| 4 | 156 | prc |
| 5 | 156 | 中国內地 |
| 6 | 156 | 中国大陆 |
| 7 | 156 | 中国 |
| 8 | 156 | new china |
| 9 | 156 | communist china |
| 10 | 156 | china |
| 11 | 156 | 新中国 |
| 12 | 156 | 中共 |
| 13 | 356 | भारतवर्ष |
| 14 | 356 | ഭാരതം |
| 15 | 356 | भारत |
| 16 | 356 | republic of india |
| 17 | 356 | india |
| 18 | 356 | பாரதம் |
| 19 | 356 | bhāratam भारतम् |
# putting this in a function as it can be quite slow
def run_location_matching():
memo = {}
# try to match location string to country numeric code
def match_location(location):
if location in memo: # should be redundant as values should be unique
return memo[location]
# try to match flags first
for country, flag in zip(geo_flags['numeric'], geo_flags['flag']):
if flag in location:
memo[location] = country
return country
# match names/terms to country
for country, name in zip(geo_names['numeric'], geo_names['name']):
if name in location and re.search(rf'(^|[^a-z]){re.escape(name)}([^a-z]|$)', location, re.I):
memo[location] = country
return country
# no match
memo[location] = 0
return 0
# run matching for all the locations
twt_locations['country'] = twt_locations['location'].str.lower().apply(match_location)
# matching by US state codes
# boolean mask for locations with unmatched country
mask = twt_locations['country']==0
unmatched = twt_locations.loc[
mask,'location'
].str.replace(
'.','', regex=False
)
# regex for matching any state code
regex = "|".join(geo_states["Code"].apply(re.escape))
# set country as USA if matches state code
twt_locations.loc[
mask &
unmatched.str.contains(
rf'(?:^|[^a-zA-Z])(?:{regex})(?:[^a-zA-Z]|$)'
) &
unmatched.str.contains(
rf', ?(?:{regex})(?:[^a-z]|$)',
case=False
),
'country'
] = 840 # USA
# map matched locations (twt_location) to each tweet in df_twitter
df_twitter['location'] = df_twitter['location'].str.strip()
df_twitter = pd.merge(
df_twitter, twt_locations, on='location'
).drop('location', axis=1) # remove location field after matching
# downcast to int
df_twitter['country'] = pd.to_numeric(df_twitter['country'], downcast='integer')
# probbaly redundant
df_twitter['username'] = df_twitter['username'].astype('string')
# write to csv with pyarrow
arcsv.write_csv(Table.from_pandas(df_twitter, preserve_index=False), 'data/twitter/wordle.csv')
# save data types
df_twitter.dtypes.to_csv('data/twitter/dtypes.csv')
Another platform where users share their results is Reddit. This is mainly done in the Daily Wordle threads in the r/wordle subreddit. Reddit's greater text formatting functionality allows users to also share their guess words using spoiler tags. >!SPOILER!< As such, the sharing format is similar to the standard format, but with the word in spoiler tags usually on the same line as each row of emojis.
reddit_raw = read_csv('data/reddit/comments.csv')
reddit_raw.info()
reddit_raw.head(20)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 59614 entries, 0 to 59613 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 author 59614 non-null object 1 body 59614 non-null object 2 score 59614 non-null int64 3 created_utc 59614 non-null int64 dtypes: int64(2), object(2) memory usage: 1.8+ MB
| author | body | score | created_utc | |
|---|---|---|---|---|
| 0 | lavinient | Scoredle 439 4/6* \n\n14,855 \n⬛⬛🟨⬛⬛ >!SLIME... | 26 | 1662010774 |
| 1 | BlitzAceSamy | Scoredle 439 3/6* \n\n|Guess|Result|Scoredle|... | 11 | 1661943874 |
| 2 | eliw23 | Scoredle 439 5/6* \n\n12,974 \n⬜⬜⬜🟨⬜ >!STAIR... | 9 | 1662007274 |
| 3 | blackbeanqueen | Scoredle 439 3/6* \n\n12,974 \n⬜⬜⬜⬜⬜ >!SLATE... | 8 | 1662065267 |
| 4 | MidnightExcursion | Did you hear about the >!mushroom!< that got k... | 7 | 1662019442 |
| 5 | BlackJPI | Scoredle 439 4/6* \n\n14,855 \n⬜⬜⬜⬜⬜ >!STEAM... | 8 | 1662029435 |
| 6 | cheezislife | Scoredle 439 4/6* \n\n14,855 \n🟩⬛⬛⬛⬛ >!FLASH... | 6 | 1662067855 |
| 7 | HS007 | Scoredle 439 6/6 \n\n14,855 \n* ⬛🟨⬛🟨⬛ >!PIOU... | 5 | 1661972028 |
| 8 | Pinnacle55 | Scoredle 439 3/6* \n\n14,855 \n⬛⬛⬛🟨⬛ >!STONE... | 5 | 1662020156 |
| 9 | Inna_Bien | >!words that end with i are a special kind of ... | 5 | 1662022149 |
| 10 | shombled | Scoredle 439 3/6* \n\n14,855 \n⬛⬛⬛⬛⬛ >!STARE... | 4 | 1662006454 |
| 11 | LadeeAlana | People say that I'm kind of a >!FUN GUY!< myself. | 3 | 1662021326 |
| 12 | TaurAnder | Scoredle 439 3/6* \n\n14,855 \n* 🟩⬛🟨⬛⬛ >!FLI... | 4 | 1662022421 |
| 13 | cyrano4833 | Scoredle 439 5/6* \n\n12,974 \n⬜⬜⬜⬜⬜ >!STORE... | 3 | 1662028951 |
| 14 | oglabradoodle | Scoredle 439 4/6* \n\n12,974 \n⬜⬜⬜⬜⬜ >!STEAM... | 5 | 1662040052 |
| 15 | a-little-bit-this | Scoredle 439 4/6* \n\n14,855 \n* ⬜⬜⬜🟨⬜ >!ALO... | 3 | 1662054311 |
| 16 | mohamedzu | Scoredle 439 3/6* \n\n14,855 \n🟩🟨🟨⬛⬛ >!FIGHT... | 4 | 1662055968 |
| 17 | KingD123 | Scoredle 439 3/6* \n\n12,974 \n⬛⬛⬛🟨⬛ >!PLANK... | 4 | 1662068789 |
| 18 | soleilady | Scoredle 439 4/6* \n\n12,974 \n⬜⬜🟨⬜⬜ >!ARISE... | 3 | 1662072125 |
| 19 | calscigal | Scoredle 439 3/6* \n\n12,974 \n* ⬛⬛⬛⬛⬛ >!CHA... | 4 | 1662075325 |
r_extract = reddit_raw['body'].str.extract(
''.join([
# match header and extract wordle number, score, hard mode
r'(?:Score|Wor)dle.*(?P<wordle>[1-4]\d\d).*(?P<score>[0-6X])/6(?P<hard>\*)?',
# find the first row of the emoji and get the spoiler word in that line
'[^🟥🟧🟨🟩🟦🟪⬛⬜🟫]*\n[^\n]*>!(?P<word>[a-zA-Z]{5}).*!<'])
)
r_extract.head(20)
| wordle | score | hard | word | |
|---|---|---|---|---|
| 0 | 439 | 4 | * | SLIME |
| 1 | 439 | 3 | * | LOSER |
| 2 | 439 | 5 | * | STAIR |
| 3 | 439 | 3 | * | SLATE |
| 4 | NaN | NaN | NaN | NaN |
| 5 | 439 | 4 | * | STEAM |
| 6 | 439 | 4 | * | FLASH |
| 7 | 439 | 6 | NaN | PIOUS |
| 8 | 439 | 3 | * | STONE |
| 9 | NaN | NaN | NaN | NaN |
| 10 | 439 | 3 | * | STARE |
| 11 | NaN | NaN | NaN | NaN |
| 12 | 439 | 3 | * | FLIRT |
| 13 | 439 | 5 | * | STORE |
| 14 | 439 | 4 | * | STEAM |
| 15 | 439 | 4 | * | ALONE |
| 16 | 439 | 3 | * | FIGHT |
| 17 | 439 | 3 | * | PLANK |
| 18 | 439 | 4 | * | ARISE |
| 19 | 439 | 3 | * | CHART |
# remove non matches
r_scores = r_extract.dropna(subset='wordle').copy()
# convert dtypes for efficiency
r_scores['wordle'] = r_scores['wordle'].astype('int16')
r_scores['score'] = r_scores['score'].replace(['x','X'],0).astype('int8')
r_scores['word'] = r_scores['word'].str.lower().astype('string')
r_scores['hard'] = r_scores['hard'] == '*'
r_scores.head(20)
| wordle | score | hard | word | |
|---|---|---|---|---|
| 0 | 439 | 4 | True | slime |
| 1 | 439 | 3 | True | loser |
| 2 | 439 | 5 | True | stair |
| 3 | 439 | 3 | True | slate |
| 5 | 439 | 4 | True | steam |
| 6 | 439 | 4 | True | flash |
| 7 | 439 | 6 | False | pious |
| 8 | 439 | 3 | True | stone |
| 10 | 439 | 3 | True | stare |
| 12 | 439 | 3 | True | flirt |
| 13 | 439 | 5 | True | store |
| 14 | 439 | 4 | True | steam |
| 15 | 439 | 4 | True | alone |
| 16 | 439 | 3 | True | fight |
| 17 | 439 | 3 | True | plank |
| 18 | 439 | 4 | True | arise |
| 19 | 439 | 3 | True | chart |
| 20 | 439 | 4 | False | avoid |
| 21 | 439 | 3 | True | raise |
| 22 | 439 | 3 | True | rugby |
Data related to Wordle answers and guesses. Most do not require cleaning as they were scraped into proper formats already.
answers = pd.read_csv('data/words/answers.csv', index_col='n')
answers.head(20)
| answers | date | |
|---|---|---|
| n | ||
| 179 | trace | 2021-12-15 |
| 180 | using | 2021-12-16 |
| 181 | peach | 2021-12-17 |
| 182 | champ | 2021-12-18 |
| 183 | baton | 2021-12-19 |
| 184 | brake | 2021-12-20 |
| 185 | pluck | 2021-12-21 |
| 186 | craze | 2021-12-22 |
| 187 | gripe | 2021-12-23 |
| 188 | weary | 2021-12-24 |
| 189 | picky | 2021-12-25 |
| 190 | acute | 2021-12-26 |
| 191 | ferry | 2021-12-27 |
| 192 | aside | 2021-12-28 |
| 193 | tapir | 2021-12-29 |
| 194 | troll | 2021-12-30 |
| 195 | unify | 2021-12-31 |
| 196 | rebus | 2022-01-01 |
| 197 | boost | 2022-01-02 |
| 198 | truss | 2022-01-03 |
freq = pd.read_csv('data/words/freq.csv')
freq.head(20)
| n | answers | freq | zipf | |
|---|---|---|---|---|
| 0 | 179 | trace | 1.550000e-05 | 4.19 |
| 1 | 180 | using | 2.950000e-04 | 5.47 |
| 2 | 181 | peach | 5.890000e-06 | 3.77 |
| 3 | 182 | champ | 7.940000e-06 | 3.90 |
| 4 | 183 | baton | 4.370000e-06 | 3.64 |
| 5 | 184 | brake | 8.910000e-06 | 3.95 |
| 6 | 185 | pluck | 1.350000e-06 | 3.13 |
| 7 | 186 | craze | 2.240000e-06 | 3.35 |
| 8 | 187 | gripe | 6.460000e-07 | 2.81 |
| 9 | 188 | weary | 4.070000e-06 | 3.61 |
| 10 | 189 | picky | 2.190000e-06 | 3.34 |
| 11 | 190 | acute | 1.000000e-05 | 4.00 |
| 12 | 191 | ferry | 1.050000e-05 | 4.02 |
| 13 | 192 | aside | 3.890000e-05 | 4.59 |
| 14 | 193 | tapir | 1.950000e-07 | 2.29 |
| 15 | 194 | troll | 5.370000e-06 | 3.73 |
| 16 | 195 | unify | 1.170000e-06 | 3.07 |
| 17 | 196 | rebus | 3.390000e-07 | 2.53 |
| 18 | 197 | boost | 2.570000e-05 | 4.41 |
| 19 | 198 | truss | 1.950000e-06 | 3.29 |
trends = pd.read_csv('data/trends/trends.csv')
trends.head(20)
| date | Wordle | |
|---|---|---|
| 0 | 2021-12-26 | 0 |
| 1 | 2021-12-27 | 0 |
| 2 | 2021-12-28 | 0 |
| 3 | 2021-12-29 | 0 |
| 4 | 2021-12-30 | 0 |
| 5 | 2021-12-31 | 1 |
| 6 | 2022-01-01 | 1 |
| 7 | 2022-01-02 | 1 |
| 8 | 2022-01-03 | 1 |
| 9 | 2022-01-04 | 4 |
| 10 | 2022-01-05 | 3 |
| 11 | 2022-01-06 | 4 |
| 12 | 2022-01-07 | 6 |
| 13 | 2022-01-08 | 7 |
| 14 | 2022-01-09 | 8 |
| 15 | 2022-01-10 | 8 |
| 16 | 2022-01-11 | 10 |
| 17 | 2022-01-12 | 14 |
| 18 | 2022-01-13 | 14 |
| 19 | 2022-01-14 | 15 |
regional = pd.read_csv('data/trends/regional.csv')
regional = regional.merge(geo_iso.reset_index(), left_on='geoCode', right_on='alpha2')
regional.head(20)
| geoName | geoCode | Wordle | numeric | alpha2 | alpha3 | |
|---|---|---|---|---|---|---|
| 0 | Afghanistan | AF | 0 | 4 | AF | AFG |
| 1 | Albania | AL | 2 | 8 | AL | ALB |
| 2 | Algeria | DZ | 0 | 12 | DZ | DZA |
| 3 | American Samoa | AS | 0 | 16 | AS | ASM |
| 4 | Andorra | AD | 17 | 20 | AD | AND |
| 5 | Angola | AO | 0 | 24 | AO | AGO |
| 6 | Anguilla | AI | 0 | 660 | AI | AIA |
| 7 | Antarctica | AQ | 0 | 10 | AQ | ATA |
| 8 | Antigua & Barbuda | AG | 15 | 28 | AG | ATG |
| 9 | Argentina | AR | 4 | 32 | AR | ARG |
| 10 | Armenia | AM | 1 | 51 | AM | ARM |
| 11 | Aruba | AW | 24 | 533 | AW | ABW |
| 12 | Australia | AU | 55 | 36 | AU | AUS |
| 13 | Austria | AT | 5 | 40 | AT | AUT |
| 14 | Azerbaijan | AZ | 1 | 31 | AZ | AZE |
| 15 | Bahamas | BS | 14 | 44 | BS | BHS |
| 16 | Bahrain | BH | 8 | 48 | BH | BHR |
| 17 | Bangladesh | BD | 2 | 50 | BD | BGD |
| 18 | Barbados | BB | 17 | 52 | BB | BRB |
| 19 | Belarus | BY | 0 | 112 | BY | BLR |
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from wordcloud import WordCloud
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
pio.renderers.default = 'notebook+vscode'
import statsmodels.api as sm
from scipy import stats
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
dtypes = read_csv('data/twitter/dtypes.csv').set_index('')['0'].to_dict()
df_twitter = read_csv('data/twitter/wordle.csv', dtype=dtypes)
df_twitter.info(show_counts=True)
df_twitter.head(20)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 26339556 entries, 0 to 26339555 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 wordle 26339556 non-null int16 1 score 26339556 non-null int8 2 hard 26339556 non-null bool 3 source 26339556 non-null int16 4 username 26339556 non-null object 5 dark 26339556 non-null bool 6 contrast 26339556 non-null bool 7 0 26339556 non-null uint8 8 1 26339556 non-null uint8 9 2 26339556 non-null uint8 10 3 26339556 non-null uint8 11 4 26339556 non-null uint8 12 5 26339556 non-null uint8 13 country 26339556 non-null int16 dtypes: bool(3), int16(3), int8(1), object(1), uint8(6) memory usage: 602.9+ MB
| wordle | score | hard | source | username | dark | contrast | 0 | 1 | 2 | 3 | 4 | 5 | country | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 190 | 4 | False | 1 | cafehearts | True | False | 91 | 33 | 166 | 243 | 0 | 0 | 0 |
| 1 | 191 | 4 | False | 1 | cafehearts | True | False | 8 | 1 | 180 | 243 | 0 | 0 | 0 |
| 2 | 192 | 0 | False | 1 | aquaebn | False | False | 2 | 32 | 8 | 32 | 6 | 1 | 0 |
| 3 | 192 | 5 | False | 1 | alejandroechev_ | False | False | 28 | 94 | 220 | 220 | 243 | 0 | 0 |
| 4 | 192 | 4 | False | 2 | W1LL20 | True | False | 37 | 2 | 113 | 243 | 0 | 0 | 0 |
| 5 | 192 | 4 | False | 1 | swansraiders | False | False | 1 | 95 | 113 | 243 | 0 | 0 | 0 |
| 6 | 192 | 3 | False | 0 | tezzamezza28 | True | False | 172 | 221 | 243 | 0 | 0 | 0 | 0 |
| 7 | 192 | 3 | False | 1 | t8tweet | False | False | 32 | 113 | 243 | 0 | 0 | 0 | 0 |
| 8 | 192 | 3 | False | 1 | juicyfruitygirl | False | False | 22 | 47 | 243 | 0 | 0 | 0 | 0 |
| 9 | 192 | 2 | False | 1 | Waterslug3 | False | False | 44 | 243 | 0 | 0 | 0 | 0 | 0 |
| 10 | 192 | 3 | False | 1 | Knightsmad1 | False | False | 109 | 99 | 243 | 0 | 0 | 0 | 0 |
| 11 | 192 | 2 | False | 1 | TheRobboBee | True | False | 211 | 243 | 0 | 0 | 0 | 0 | 0 |
| 12 | 192 | 4 | False | 1 | WemoBemo | False | False | 31 | 12 | 168 | 243 | 0 | 0 | 0 |
| 13 | 192 | 3 | False | 1 | billypilgrym | False | False | 47 | 186 | 243 | 0 | 0 | 0 | 0 |
| 14 | 192 | 3 | False | 1 | BTohiariki | True | False | 166 | 188 | 243 | 0 | 0 | 0 | 0 |
| 15 | 192 | 3 | False | 0 | GrandOldMark | False | False | 94 | 51 | 243 | 0 | 0 | 0 | 0 |
| 16 | 192 | 4 | False | 0 | AWMcLennan | False | False | 43 | 44 | 99 | 243 | 0 | 0 | 0 |
| 17 | 192 | 4 | False | 1 | judew5238 | False | False | 30 | 12 | 168 | 243 | 0 | 0 | 0 |
| 18 | 192 | 3 | False | 0 | electomagneticJ | False | False | 33 | 222 | 243 | 0 | 0 | 0 | 0 |
| 19 | 192 | 4 | False | 3 | nanbutty | False | False | 40 | 94 | 41 | 243 | 0 | 0 | 0 |
# remove tweets where the score does not match the emoji grid
df_twitter = df_twitter[
(df_twitter[[str(n) for n in range(0,6)]] > 0).sum(axis=1) == df_twitter['score'].replace(0,6)
].reset_index(drop=True)
df_twitter.head(20)
| wordle | score | hard | source | username | dark | contrast | 0 | 1 | 2 | 3 | 4 | 5 | country | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 190 | 4 | False | 1 | cafehearts | True | False | 91 | 33 | 166 | 243 | 0 | 0 | 0 |
| 1 | 191 | 4 | False | 1 | cafehearts | True | False | 8 | 1 | 180 | 243 | 0 | 0 | 0 |
| 2 | 192 | 0 | False | 1 | aquaebn | False | False | 2 | 32 | 8 | 32 | 6 | 1 | 0 |
| 3 | 192 | 5 | False | 1 | alejandroechev_ | False | False | 28 | 94 | 220 | 220 | 243 | 0 | 0 |
| 4 | 192 | 4 | False | 2 | W1LL20 | True | False | 37 | 2 | 113 | 243 | 0 | 0 | 0 |
| 5 | 192 | 4 | False | 1 | swansraiders | False | False | 1 | 95 | 113 | 243 | 0 | 0 | 0 |
| 6 | 192 | 3 | False | 0 | tezzamezza28 | True | False | 172 | 221 | 243 | 0 | 0 | 0 | 0 |
| 7 | 192 | 3 | False | 1 | t8tweet | False | False | 32 | 113 | 243 | 0 | 0 | 0 | 0 |
| 8 | 192 | 3 | False | 1 | juicyfruitygirl | False | False | 22 | 47 | 243 | 0 | 0 | 0 | 0 |
| 9 | 192 | 2 | False | 1 | Waterslug3 | False | False | 44 | 243 | 0 | 0 | 0 | 0 | 0 |
| 10 | 192 | 3 | False | 1 | Knightsmad1 | False | False | 109 | 99 | 243 | 0 | 0 | 0 | 0 |
| 11 | 192 | 2 | False | 1 | TheRobboBee | True | False | 211 | 243 | 0 | 0 | 0 | 0 | 0 |
| 12 | 192 | 4 | False | 1 | WemoBemo | False | False | 31 | 12 | 168 | 243 | 0 | 0 | 0 |
| 13 | 192 | 3 | False | 1 | billypilgrym | False | False | 47 | 186 | 243 | 0 | 0 | 0 | 0 |
| 14 | 192 | 3 | False | 1 | BTohiariki | True | False | 166 | 188 | 243 | 0 | 0 | 0 | 0 |
| 15 | 192 | 3 | False | 0 | GrandOldMark | False | False | 94 | 51 | 243 | 0 | 0 | 0 | 0 |
| 16 | 192 | 4 | False | 0 | AWMcLennan | False | False | 43 | 44 | 99 | 243 | 0 | 0 | 0 |
| 17 | 192 | 4 | False | 1 | judew5238 | False | False | 30 | 12 | 168 | 243 | 0 | 0 | 0 |
| 18 | 192 | 3 | False | 0 | electomagneticJ | False | False | 33 | 222 | 243 | 0 | 0 | 0 | 0 |
| 19 | 192 | 4 | False | 3 | nanbutty | False | False | 40 | 94 | 41 | 243 | 0 | 0 | 0 |
# number of tweets from each day
twt_counts = df_twitter['wordle'].value_counts().sort_index()
twt_counts.head(20)
190 4 191 4 192 2367 193 8749 194 9681 195 10545 196 11882 197 17729 198 23976 199 26282 200 49548 201 52132 202 75490 203 90727 204 81599 205 94473 206 135818 207 121337 208 115906 209 144076 Name: wordle, dtype: int64
index = pd.to_datetime(answers.loc[twt_counts.index,'date'])
fig = px.scatter(
x=index,
y=twt_counts,
trendline="lowess", trendline_options=dict(frac=0.06)
)
fig.update_layout(
title_text='Number of game results posted on Twitter for each Wordle',
hovermode='x unified'
)
#fig.update_traces(hovertemplate=None)
fig.update_yaxes(
title='Number of Tweets',
fixedrange=False
)
fig.update_xaxes(
title='Date',
rangeslider_visible=True,
rangeslider_yaxis_rangemode='auto',
)
fig.show()
fig = px.scatter(
x=index,
y=trends['Wordle'],
trendline="lowess", trendline_options=dict(frac=0.08)
)
fig.update_layout(
title_text='Google Trends search volume of Wordle',
hovermode='x unified'
)
#fig.update_traces(hovertemplate=None)
fig.update_yaxes(
title='Relative search volume',
fixedrange=False
)
fig.update_xaxes(
title='Date',
rangeslider_visible=True,
rangeslider_yaxis_rangemode='auto',
)
fig.show()
# All users who have posted wordle and the first wordle they posted
twt_users = df_twitter.drop_duplicates(subset='username')[['wordle','username']]
twt_users.head(20)
| wordle | username | |
|---|---|---|
| 0 | 190 | cafehearts |
| 2 | 192 | aquaebn |
| 3 | 192 | alejandroechev_ |
| 4 | 192 | W1LL20 |
| 5 | 192 | swansraiders |
| 6 | 192 | tezzamezza28 |
| 7 | 192 | t8tweet |
| 8 | 192 | juicyfruitygirl |
| 9 | 192 | Waterslug3 |
| 10 | 192 | Knightsmad1 |
| 11 | 192 | TheRobboBee |
| 12 | 192 | WemoBemo |
| 13 | 192 | billypilgrym |
| 14 | 192 | BTohiariki |
| 15 | 192 | GrandOldMark |
| 16 | 192 | AWMcLennan |
| 17 | 192 | judew5238 |
| 18 | 192 | electomagneticJ |
| 19 | 192 | nanbutty |
| 20 | 192 | 99FascinationSt |
# Total number of users post for each day
twt_users_all = df_twitter.groupby('wordle')['username'].unique().str.len().to_frame(name='total')
# New users (based on first wordle)
twt_users_all['new'] = twt_users['wordle'].value_counts().sort_index()
# Users who are not new
twt_users_all['repeat'] = twt_users_all['total'] - twt_users_all['new']
twt_users_all.head(20)
| total | new | repeat | |
|---|---|---|---|
| wordle | |||
| 190 | 4 | 4 | 0 |
| 191 | 4 | 2 | 2 |
| 192 | 2353 | 2349 | 4 |
| 193 | 8663 | 7160 | 1503 |
| 194 | 9601 | 4767 | 4834 |
| 195 | 10457 | 3699 | 6758 |
| 196 | 11771 | 4205 | 7566 |
| 197 | 17562 | 7501 | 10061 |
| 198 | 23784 | 10285 | 13499 |
| 199 | 26094 | 11276 | 14818 |
| 200 | 49093 | 25870 | 23223 |
| 201 | 51671 | 21131 | 30540 |
| 202 | 74632 | 32080 | 42552 |
| 203 | 89638 | 35373 | 54265 |
| 204 | 80846 | 22851 | 57995 |
| 205 | 93668 | 29749 | 63919 |
| 206 | 134375 | 45261 | 89114 |
| 207 | 120150 | 30509 | 89641 |
| 208 | 114920 | 24858 | 90062 |
| 209 | 142772 | 38171 | 104601 |
fig = px.area(
twt_users_all,
x=twt_users_all.index,
y=['new', 'repeat'],
labels={'variable':'Users'},
title='Number of new and repeat users posting Wordle scores over time'
)
fig.update_yaxes(title='Number of users',fixedrange=False)
fig.update_xaxes(
title='Wordle',
rangeslider_visible=True,
rangeslider_yaxis_rangemode='auto'
)
fig.update_traces(hovertemplate=None)
fig.update_layout(hovermode='x unified')
fig.show()
fig = px.area(
twt_users_all.divide(twt_users_all['total'], axis=0),
x=twt_users_all.index,
y=['new', 'repeat'],
labels={'variable':'Users'},
title='Proportion of new and repeat users posting Wordle scores over time',
range_y=[0,1]
)
fig.update_yaxes(title='Proportion of users',fixedrange=False)
fig.update_xaxes(
title='Wordle',
rangeslider_visible=True,
rangeslider_yaxis_rangemode='auto'
)
fig.update_traces(hovertemplate=None)
fig.update_layout(hovermode='x unified')
fig.show()
country_users = df_twitter.groupby('country')['username'].unique().str.len().drop(0).rename('count')
country_users = pd.merge(country_users, geo_iso, left_index=True, right_index=True)
country_users['log'] = np.log10(country_users['count'])
country_users.head(20)
| count | alpha2 | alpha3 | log | |
|---|---|---|---|---|
| 4 | 53 | AF | AFG | 1.724276 |
| 8 | 53 | AL | ALB | 1.724276 |
| 10 | 3 | AQ | ATA | 0.477121 |
| 12 | 119 | DZ | DZA | 2.075547 |
| 16 | 3 | AS | ASM | 0.477121 |
| 20 | 28 | AD | AND | 1.447158 |
| 24 | 29 | AO | AGO | 1.462398 |
| 28 | 78 | AG | ATG | 1.892095 |
| 31 | 91 | AZ | AZE | 1.959041 |
| 32 | 3299 | AR | ARG | 3.518382 |
| 36 | 23928 | AU | AUS | 4.378906 |
| 40 | 1395 | AT | AUT | 3.144574 |
| 44 | 52 | BS | BHS | 1.716003 |
| 48 | 334 | BH | BHR | 2.523746 |
| 50 | 548 | BD | BGD | 2.738781 |
| 51 | 71 | AM | ARM | 1.851258 |
| 52 | 287 | BB | BRB | 2.457882 |
| 56 | 1782 | BE | BEL | 3.250908 |
| 60 | 8 | BM | BMU | 0.903090 |
| 64 | 40 | BT | BTN | 1.602060 |
m = folium.Map(tiles='CartoDB positron')
folium.Choropleth(
geo_data='data/geo/countries.geojson',
name="choropleth",
columns=['alpha3','log'],
data=country_users,
key_on="feature.properties.ISO_A3",
fill_color="YlGnBu",
legend_name='Log10 of the number of users'
).add_to(m)
m
geo_internet[geo_internet['users']==0]
| numeric | country | users | |
|---|---|---|---|
| 141 | 408 | North Korea | 0 |
country_relative = country_users.merge(geo_internet[geo_internet['users']>0], left_index=True, right_on='numeric')
country_relative['relative'] = country_relative['count'] / country_relative['users']
country_relative.head(20)
| count | alpha2 | alpha3 | log | numeric | country | users | relative | |
|---|---|---|---|---|---|---|---|---|
| 1 | 53 | AF | AFG | 1.724276 | 4 | Afghanistan | 3672058 | 0.000014 |
| 3 | 53 | AL | ALB | 1.724276 | 8 | Albania | 1942025 | 0.000027 |
| 49 | 119 | DZ | DZA | 2.075547 | 12 | Algeria | 17438483 | 0.000007 |
| 2 | 29 | AO | AGO | 1.462398 | 24 | Angola | 3745750 | 0.000008 |
| 7 | 78 | AG | ATG | 1.892095 | 28 | Antigua and Barbuda | 73703 | 0.001058 |
| 10 | 91 | AZ | AZE | 1.959041 | 31 | Azerbaijan | 7763795 | 0.000012 |
| 5 | 3299 | AR | ARG | 3.518382 | 32 | Argentina | 31118074 | 0.000106 |
| 8 | 23928 | AU | AUS | 4.378906 | 36 | Australia | 21288325 | 0.001124 |
| 9 | 1395 | AT | AUT | 3.144574 | 40 | Austria | 7681572 | 0.000182 |
| 18 | 52 | BS | BHS | 1.716003 | 44 | Bahamas | 312986 | 0.000166 |
| 17 | 334 | BH | BHR | 2.523746 | 48 | Bahrain | 1431062 | 0.000233 |
| 15 | 548 | BD | BGD | 2.738781 | 50 | Bangladesh | 29733670 | 0.000018 |
| 6 | 71 | AM | ARM | 1.851258 | 51 | Armenia | 1882003 | 0.000038 |
| 24 | 287 | BB | BRB | 2.457882 | 52 | Barbados | 226713 | 0.001266 |
| 12 | 1782 | BE | BEL | 3.250908 | 56 | Belgium | 10021205 | 0.000178 |
| 26 | 40 | BT | BTN | 1.602060 | 64 | Bhutan | 333248 | 0.000120 |
| 22 | 51 | BO | BOL | 1.707570 | 68 | Bolivia | 4322219 | 0.000012 |
| 19 | 72 | BA | BIH | 1.857332 | 70 | Bosnia and Herzegovina | 2437067 | 0.000030 |
| 27 | 2004 | BW | BWA | 3.301898 | 72 | Botswana | 885770 | 0.002262 |
| 23 | 8949 | BR | BRA | 3.951775 | 76 | Brazil | 126403573 | 0.000071 |
m = folium.Map(tiles='CartoDB positron')
folium.Choropleth(
geo_data='data/geo/countries.geojson',
name="choropleth",
columns=['alpha3','relative'],
data=country_relative[
# remove outliers
country_relative['relative'] < country_relative['relative'].quantile(0.999)
],
key_on="feature.properties.ISO_A3",
fill_color="YlGnBu",
legend_name='Proportion of internet users in country'
).add_to(m)
m
country_counts = np.log10(df_twitter['country'].value_counts().drop(0).rename('count'))
country_counts = pd.merge(country_counts, geo_iso, left_index=True, right_index=True)
country_counts.head(20)
| count | alpha2 | alpha3 | |
|---|---|---|---|
| 840 | 6.724513 | US | USA |
| 826 | 6.272636 | GB | GBR |
| 124 | 5.897656 | CA | CAN |
| 356 | 5.793000 | IN | IND |
| 36 | 5.648072 | AU | AUS |
| 372 | 5.395740 | IE | IRL |
| 392 | 5.373050 | JP | JPN |
| 608 | 5.189496 | PH | PHL |
| 554 | 5.113823 | NZ | NZL |
| 710 | 5.072467 | ZA | ZAF |
| 276 | 5.037128 | DE | DEU |
| 458 | 5.021557 | MY | MYS |
| 120 | 4.993912 | CM | CMR |
| 360 | 4.979453 | ID | IDN |
| 288 | 4.900236 | GH | GHA |
| 76 | 4.868850 | BR | BRA |
| 388 | 4.864879 | JM | JAM |
| 643 | 4.851754 | RU | RUS |
| 724 | 4.847029 | ES | ESP |
| 706 | 4.782766 | SO | SOM |
m = folium.Map(tiles='CartoDB positron')
folium.Choropleth(
geo_data='data/geo/countries.geojson',
name="choropleth",
columns=['alpha3','count'],
data=country_counts,
key_on="feature.properties.ISO_A3",
fill_color="YlGnBu",
legend_name='Log10 of the number of tweets'
).add_to(m)
m
m = folium.Map(tiles='CartoDB positron')
folium.Choropleth(
geo_data='data/geo/countries.geojson',
name="choropleth",
columns=['alpha3','Wordle'],
data=regional,
key_on="feature.properties.ISO_A3",
fill_color="YlGnBu",
legend_name='Search interest',
).add_to(m)
m
twt_mean = df_twitter[df_twitter['score']>0].groupby('wordle')['score'].mean().rename('mean').to_frame()
twt_mean['answers'] = answers.loc[twt_mean.index,'answers']
twt_mean.head(20)
| mean | answers | |
|---|---|---|
| wordle | ||
| 190 | 4.250000 | acute |
| 191 | 4.250000 | ferry |
| 192 | 3.816205 | aside |
| 193 | 3.530411 | tapir |
| 194 | 3.752526 | troll |
| 195 | 4.327278 | unify |
| 196 | 4.391588 | rebus |
| 197 | 3.989553 | boost |
| 198 | 4.023979 | truss |
| 199 | 4.458242 | siege |
| 200 | 3.921375 | tiger |
| 201 | 4.335495 | banal |
| 202 | 4.078716 | slump |
| 203 | 4.167350 | crank |
| 204 | 4.558212 | gorge |
| 205 | 4.373089 | query |
| 206 | 3.684007 | drink |
| 207 | 4.440339 | favor |
| 208 | 4.525946 | abbey |
| 209 | 4.188266 | tangy |
scores_count = df_twitter.groupby('wordle')['score'].value_counts()
scores_count.name = 'count'
scores_count = scores_count.reset_index()
scores_count['score'] = scores_count['score'].replace(0,'X').astype('string')
scores_count = scores_count.merge(scores_count.groupby('wordle')['count'].sum().rename('total'), left_on='wordle', right_index=True)
scores_count['relative'] = scores_count['count'] / scores_count['total'] * 100
scores_count.head(20)
| wordle | score | count | total | relative | |
|---|---|---|---|---|---|
| 0 | 190 | 4 | 3 | 4 | 75.000000 |
| 1 | 190 | 5 | 1 | 4 | 25.000000 |
| 2 | 191 | 2 | 1 | 4 | 25.000000 |
| 3 | 191 | 4 | 1 | 4 | 25.000000 |
| 4 | 191 | 5 | 1 | 4 | 25.000000 |
| 5 | 191 | 6 | 1 | 4 | 25.000000 |
| 6 | 192 | 4 | 777 | 2367 | 32.826362 |
| 7 | 192 | 3 | 763 | 2367 | 32.234896 |
| 8 | 192 | 5 | 444 | 2367 | 18.757921 |
| 9 | 192 | 2 | 196 | 2367 | 8.280524 |
| 10 | 192 | 6 | 155 | 2367 | 6.548373 |
| 11 | 192 | X | 22 | 2367 | 0.929447 |
| 12 | 192 | 1 | 10 | 2367 | 0.422476 |
| 13 | 193 | 3 | 3587 | 8749 | 40.998971 |
| 14 | 193 | 4 | 2689 | 8749 | 30.734941 |
| 15 | 193 | 2 | 1057 | 8749 | 12.081381 |
| 16 | 193 | 5 | 1003 | 8749 | 11.464167 |
| 17 | 193 | 6 | 348 | 8749 | 3.977597 |
| 18 | 193 | X | 35 | 8749 | 0.400046 |
| 19 | 193 | 1 | 30 | 8749 | 0.342896 |
scores_count_total = scores_count.groupby('score')['count'].sum()
fig = px.bar(
scores_count_total, # Total count of each score
x='count',
text_auto = '.3s',
title='Total number of tweets of each Wordle score',
)
fig.update_traces(
hovertemplate=None,
marker_color=['#0a5e03','#50ba47','#b1f073','#f7f75c','#e8ac15','#cf4911','#851101'],
marker_line=dict(width=1.2, color='black')
)
fig.update_layout(hovermode='y')
fig.show()
fig = px.pie(
names=scores_count_total.index, values=scores_count_total, hole=0.5,
title='Proportion of tweets with score'
)
fig.update_traces(
marker_colors=['#0a5e03','#50ba47','#b1f073','#f7f75c','#e8ac15','#cf4911','#851101'],
marker_line=dict(width=1.2, color='black')
)
fig.show()
fig = px.area(scores_count, x='wordle', y='count', color='score',
category_orders={'score':['1','2','3','4','5','6','X']},
color_discrete_map={'1':'#0a5e03','2':'#50ba47','3':'#b1f073','4':'#f7f75c','5':'#e8ac15','6':'#cf4911','X':'#851101'},
title='Number of tweets by score over time'
)
fig.update_yaxes(title='Number of Tweets',fixedrange=False)
fig.update_xaxes(title='__Date__',rangeslider_visible=True,rangeslider_yaxis_rangemode='auto') #TODO date
fig.update_traces(hovertemplate=None)
fig.update_layout(hovermode='x unified')
fig.show()
fig = px.area(scores_count, x='wordle', y='relative', color='score',
category_orders={'score':['1','2','3','4','5','6','X']},
color_discrete_map={'1':'#0a5e03','2':'#50ba47','3':'#b1f073','4':'#f7f75c','5':'#e8ac15','6':'#cf4911','X':'#851101'},
range_y=[0,100],
title='Proportion of tweets with score over time'
)
fig.update_yaxes(title='Proportion of Tweets',fixedrange=False)
fig.update_xaxes(title='__Date__',rangeslider_visible=True,rangeslider_yaxis_rangemode='auto') #TODO date
fig.update_traces(hovertemplate='%{y:.2f}%')
fig.update_layout(hovermode='x unified')
fig.show()
scores_count_sorted = scores_count.merge(twt_mean, on='wordle').sort_values(by=['mean','wordle'])
scores_count_sorted.head(20)
| wordle | score | count | total | relative | mean | answers | |
|---|---|---|---|---|---|---|---|
| 895 | 319 | 3 | 32253 | 97930 | 32.934749 | 3.101278 | train |
| 896 | 319 | 2 | 26394 | 97930 | 26.951904 | 3.101278 | train |
| 897 | 319 | 4 | 21382 | 97930 | 21.833963 | 3.101278 | train |
| 898 | 319 | 5 | 8770 | 97930 | 8.955376 | 3.101278 | train |
| 899 | 319 | 1 | 5822 | 97930 | 5.945063 | 3.101278 | train |
| 900 | 319 | 6 | 3001 | 97930 | 3.064434 | 3.101278 | train |
| 901 | 319 | X | 308 | 97930 | 0.314510 | 3.101278 | train |
| 811 | 307 | 3 | 43884 | 111993 | 39.184592 | 3.312547 | plant |
| 812 | 307 | 4 | 30758 | 111993 | 27.464217 | 3.312547 | plant |
| 813 | 307 | 2 | 21736 | 111993 | 19.408356 | 3.312547 | plant |
| 814 | 307 | 5 | 10508 | 111993 | 9.382729 | 3.312547 | plant |
| 815 | 307 | 6 | 2904 | 111993 | 2.593019 | 3.312547 | plant |
| 816 | 307 | 1 | 1918 | 111993 | 1.712607 | 3.312547 | plant |
| 817 | 307 | X | 285 | 111993 | 0.254480 | 3.312547 | plant |
| 720 | 294 | 3 | 45538 | 126430 | 36.018350 | 3.341545 | stair |
| 721 | 294 | 4 | 33418 | 126430 | 26.432018 | 3.341545 | stair |
| 722 | 294 | 2 | 26760 | 126430 | 21.165863 | 3.341545 | stair |
| 723 | 294 | 5 | 13236 | 126430 | 10.469034 | 3.341545 | stair |
| 724 | 294 | 6 | 4536 | 126430 | 3.587756 | 3.341545 | stair |
| 725 | 294 | 1 | 1948 | 126430 | 1.540774 | 3.341545 | stair |
fig = px.area(scores_count_sorted, x='answers', y='relative', color='score',
category_orders={'score':['1','2','3','4','5','6','X']},
color_discrete_map={'1':'#0a5e03','2':'#50ba47','3':'#b1f073','4':'#f7f75c','5':'#e8ac15','6':'#cf4911','X':'#851101'},
range_y=[0,100],
title='Proportion of tweets with score sorted by mean score'
)
fig.update_yaxes(title='Number of Tweets',fixedrange=False)
fig.update_xaxes(title='Word',rangeslider_visible=True,rangeslider_yaxis_rangemode='auto') #TODO date
fig.update_traces(hovertemplate='%{y:.2f}%')
fig.update_layout(hovermode='x unified')
fig.show()
fig = go.Figure()
fig.add_trace(go.Scatter(
x=index,
y=twt_mean['mean'],
mode='lines+markers',
name='Mean score',
text=twt_mean['answers'],
hovertemplate='<b>%{text}</b><br>Mean score: %{y}<br><extra></extra>'
))
fig.add_trace(go.Scatter(
x=index,
y=np.array(
sm.nonparametric.lowess(twt_mean['mean'], twt_mean.index, frac=8/100)
)[:,1],
mode='lines',
name='LOWESS fit',
line=dict(width=3),hovertemplate=None
))
mean = df_twitter['score'].mean()
fig.update_layout(
title_text='Average (mean) score of Wordle games',
hovermode='x unified'
)
#fig.update_traces(hovertemplate=None)
fig.update_yaxes(
title='Mean score',
fixedrange=False,
range=[3,5]
)
fig.update_xaxes(
title='Date',
range=[index.iloc[0], index.iloc[-1]],
rangeslider_visible=True,
rangeslider_yaxis_rangemode='auto',
rangeslider_range=[index.iloc[0], index.iloc[-1]]
)
fig.show()
px.box(x=twt_mean['mean'], title='Distribution of the average score of each day')
answer_trends = pd.read_csv('data/words/trends.csv', index_col=0)
answer_trends = answer_trends.apply(lambda col: pd.DataFrame(sm.nonparametric.lowess(col, col.index, frac=14/100)).set_index(0).squeeze(), axis=0)
answer_spikes = answer_trends.iloc[50:100].mean() / answer_trends.iloc[np.r_[:50,100:]].mean()
answer_spikes.name = 'spikes'
answer_spikes.sort_values().tail(20)
slosh 2.905279 aphid 2.905921 ruder 2.909033 coyly 2.909570 gloat 2.917525 midge 2.918356 glean 2.922953 canny 2.925490 lowly 2.927586 piety 2.937736 crept 2.949311 droll 2.949753 wrung 2.953050 cynic 2.954943 gawky 2.966924 swill 2.975927 tacit 2.976672 foray 2.980783 parer 2.997150 trice 2.997755 Name: spikes, dtype: float64
'spikes' refers to the Google Trends search interest of a word around the time it is an answer. With more unknown answers, there are likely to be more people searching for the word online. However, this is a somewhat arbitrary measure.
mean_spikes = twt_mean.merge(answer_spikes, left_on='answers', right_index=True)
px.scatter(
mean_spikes,
y=mean_spikes['spikes'], x='mean',
marginal_x='histogram',
marginal_y='histogram'
)
stats.pearsonr(mean_spikes['mean'], mean_spikes['spikes'])
(0.30508543269175975, 8.770829318927126e-07)
ans_freq = freq.merge(twt_mean)
ans_freq.head(20)
| n | answers | freq | zipf | mean | |
|---|---|---|---|---|---|
| 0 | 190 | acute | 1.000000e-05 | 4.00 | 4.250000 |
| 1 | 191 | ferry | 1.050000e-05 | 4.02 | 4.250000 |
| 2 | 192 | aside | 3.890000e-05 | 4.59 | 3.816205 |
| 3 | 193 | tapir | 1.950000e-07 | 2.29 | 3.530411 |
| 4 | 194 | troll | 5.370000e-06 | 3.73 | 3.752526 |
| 5 | 195 | unify | 1.170000e-06 | 3.07 | 4.327278 |
| 6 | 196 | rebus | 3.390000e-07 | 2.53 | 4.391588 |
| 7 | 197 | boost | 2.570000e-05 | 4.41 | 3.989553 |
| 8 | 198 | truss | 1.950000e-06 | 3.29 | 4.023979 |
| 9 | 199 | siege | 8.510000e-06 | 3.93 | 4.458242 |
| 10 | 200 | tiger | 2.000000e-05 | 4.30 | 3.921375 |
| 11 | 201 | banal | 7.590000e-07 | 2.88 | 4.335495 |
| 12 | 202 | slump | 2.290000e-06 | 3.36 | 4.078716 |
| 13 | 203 | crank | 4.070000e-06 | 3.61 | 4.167350 |
| 14 | 204 | gorge | 2.510000e-06 | 3.40 | 4.558212 |
| 15 | 205 | query | 4.370000e-06 | 3.64 | 4.373089 |
| 16 | 206 | drink | 7.940000e-05 | 4.90 | 3.684007 |
| 17 | 207 | favor | 3.800000e-05 | 4.58 | 4.440339 |
| 18 | 208 | abbey | 7.590000e-06 | 3.88 | 4.525946 |
| 19 | 209 | tangy | 4.370000e-07 | 2.64 | 4.188266 |
px.scatter(
ans_freq,
y='zipf', x='mean',
marginal_x='histogram',
marginal_y='histogram'
)
stats.pearsonr(ans_freq['mean'], ans_freq['zipf'])
(-0.35770111812269373, 5.8445702885770625e-09)
answers['answers'].apply(set).str.len()
n
179 5
180 5
181 5
182 5
183 5
..
454 4
455 5
456 5
457 5
458 5
Name: answers, Length: 280, dtype: int64 twt_mean.merge(
answers['answers'].apply(set).str.len().rename('unique'),
how='inner', left_index=True, right_index=True)
| mean | answers | unique | |
|---|---|---|---|
| 190 | 4.250000 | acute | 5 |
| 191 | 4.250000 | ferry | 4 |
| 192 | 3.816205 | aside | 5 |
| 193 | 3.530411 | tapir | 5 |
| 194 | 3.752526 | troll | 4 |
| ... | ... | ... | ... |
| 435 | 4.686065 | gauze | 5 |
| 436 | 3.816523 | chief | 5 |
| 437 | 3.959463 | onset | 5 |
| 438 | 4.652926 | prize | 5 |
| 439 | 4.255168 | fungi | 5 |
250 rows × 3 columns
ans_unique = twt_mean.merge(answers['answers'].apply(set).str.len().rename('unique'), how='inner', left_index=True, right_index=True)
fig = px.scatter(
ans_unique,
y='unique', x='mean',custom_data=['answers'],
marginal_x='histogram',
marginal_y='histogram'
)
fig.update_traces(hovertemplate='<b>%{customdata[0]}</b><br>Mean score: %{x}<br>Unique letter: %{y}',selector=dict(type='scatter'))
stats.pearsonr(ans_unique['mean'], ans_unique['unique'])
(-0.4254767213877685, 2.056215220036592e-12)
def match(guess, answer):
result = [0] * 5
guess_matched = [False] * 5
answer_matched = [False] * 5
for i in range(len(guess)):
if guess[i] == answer[i] and not answer_matched[i]:
result[i] = 2
guess_matched[i] = True
answer_matched[i] = True
for i in range(len(guess)):
guess_letter = guess[i]
if not guess_matched[i]:
for j in range(len(answer)):
if not answer_matched[j] and guess_letter == answer[j]:
result[i] = 1
answer_matched[j] = True
break
return result
def base3(n):
digits = []
while n:
digits.append(n % 3)
n //= 3
return ([0,0,0,0,0] + (digits[::-1] or [0]))[-5:]
greens = pd.concat([df_twitter['score'], df_twitter['0'].map((pd.Series(range(0,243), index=range(1,244)).apply(lambda n: pd.Series(base3(n))) == 2).sum(axis=1)).astype('int8').rename('greens')], axis=1)
px.scatter(
greens.value_counts().rename('count').reset_index(),
x='score',y='greens',size='count',
title='Comparision of the number of greens in the first row vs score, and the count of the combination'
)
stats.pearsonr(greens['score'], greens['greens'])
(-0.2692805753434382, 0.0)
greys = pd.concat([df_twitter['score'], df_twitter['0'].map((pd.Series(range(0,243), index=range(1,244)).apply(lambda n: pd.Series(base3(n))) == 0).sum(axis=1)).astype('int8').rename('greys')], axis=1)
stats.pearsonr(greys['score'], greys['greys'])
(0.26993847548792826, 0.0)
wordcloud = WordCloud(width = 1600, height = 900,
background_color ='white',
min_font_size = 10).generate_from_frequencies(r_scores['word'].value_counts())
fig = px.imshow(wordcloud)
fig.update_layout(margin=dict(l=10,r=10,t=10,b=10),hovermode=False)
fig.update_xaxes(showticklabels=False)
fig.update_yaxes(showticklabels=False)
fig.show()
word_scores = pd.merge(
r_scores.loc[r_scores['score']>0].groupby('word')['score'].mean(),
r_scores['word'].value_counts().rename('count'),
left_index=True, right_index=True
)
word_scores.head(20)
| score | count | |
|---|---|---|
| abase | 5.000000 | 1 |
| abate | 3.000000 | 1 |
| abhor | 4.333333 | 3 |
| abide | 4.285714 | 7 |
| abled | 3.000000 | 1 |
| abode | 4.500000 | 2 |
| abort | 3.555556 | 9 |
| about | 3.952381 | 130 |
| above | 3.833333 | 6 |
| abuse | 3.666667 | 6 |
| abysm | 4.000000 | 1 |
| acers | 6.000000 | 1 |
| ached | 2.000000 | 1 |
| aches | 4.500000 | 2 |
| acids | 4.000000 | 1 |
| acorn | 3.500000 | 20 |
| acres | 3.600000 | 5 |
| acrid | 4.000000 | 7 |
| actor | 3.714286 | 7 |
| acute | 3.333333 | 15 |
fig = px.scatter(
word_scores.sort_values(by='count').tail(50).reset_index(),
x='score',
y='count',
text='index',
title='Number of uses vs mean score of the 50 most popular starting words'
)
fig.update_traces(textposition='top center')
fig.show()
fig = px.scatter(
word_scores[word_scores['count']>5].sort_values(by='score').head(50).reset_index(),
x='score',
y='count',
text='index',
title='Number of uses vs mean score of the 50 best starting words (by mean)'
)
fig.update_traces(textposition='top center')
fig.show()
To analyse the popularity of things, Google Trends can be used. Google Trends data can be used to gauge the popularity of a keyword online by measuring the web search volume of the keyword. By acquiring the search volume of 'Wordle' over time, it can used as a useful estimate of how popular the game is.
df_pop = pd.DataFrame({
'tweets':sm.nonparametric.lowess(twt_counts, twt_counts.index, frac=0.06)[:,1],
'search':sm.nonparametric.lowess(trends['Wordle'], trends.index, frac=0.1)[:,1]
})
fig = px.line(
x=index,
y=df_pop['search'] / df_pop['search'].max() * 100,
title='LOWESS fit of the relative search volume for Wordle over time'
)
fig.update_layout(
hovermode='x unified'
)
fig.update_traces(hovertemplate='%{y:.2f}%')
fig.update_yaxes(
title='Relative volume (%)',
fixedrange=False
)
fig.update_xaxes(
title='Date',
rangeslider_visible=True,
rangeslider_yaxis_rangemode='auto'
)
fig.show()
The Locally Weighted Scatterplot Smoothing (LOWESS) algorithm was applied to the raw data to reduce noise and more clearly identify trends.
The search volume for Wordle increased from December 2021, representing its surge in popularity after initially being released to the public in October. The search volume reached a peak in February 2022, with the highest search volume recorded on 17 February 2022. Since then, the search volume has been generally on the decline, decreasing to 40% of the maximum search volume in September 2022.
In mid December 2021, Wordle introduced the sharing feature, which allowed users to share their game result wihtout spoiling the game for others. This is acheived by sharing a grid of coloured emoji squares, with each row representing a guess, and the colour of the emoji square representing whether the letter is absent, present, or correct. This could be easily shared on social media platforms like Twitter, where the cryptic emoji grids quickly went viral.
Again with the rhyming game. I lucked out this time!#Wordle420 4/6*
— Gayla (@AZGayla) August 13, 2022
⬛⬛⬛🟨⬛
⬛⬛🟩⬛🟩
⬛🟩🟩🟩🟩
🟩🟩🟩🟩🟩 pic.twitter.com/cySnOsjYSg
Wordle 420 X/6
— 🎃Hogueus Woahcus 🎃 (@HogueLikeWoah) August 14, 2022
⬛🟧⬛⬛⬛
⬛🟧⬛⬛⬛
⬛🟧⬛🟦🟧
🟧🟧⬛⬛🟧
🟧🟧⬛⬛🟧
🟧🟧⬛🟧🟧
this was so rude
fig = px.scatter(
x=index,
y=twt_counts,
trendline="lowess", trendline_options=dict(frac=0.06),
title='The number of Wordle scores shared on Twitter over time'
)
fig.update_layout(
hovermode='x unified'
)
#fig.update_traces(hovertemplate=None)
fig.update_yaxes(
title='Number of tweets',
fixedrange=False
)
fig.update_xaxes(
title='Date',
rangeslider_visible=True,
rangeslider_yaxis_rangemode='auto'
)
fig.update_traces(marker_size=5, hovertemplate=None)
fig.update_traces(hovertemplate='%{y} <b>(LOWESS trend)</b>', selector = dict(type='scatter', mode='lines'))
fig.show()
The number of Wordle scores shared on Twitter follows a similar trend, with an increase from December 2021 to a peak in February 2022, and a subsequent decrease. However, it has a slightly earlier peak, with a maximum of 321.7k tweets on 4 February 2022. Compared to the search volume, Tweet volume decreased more significantly, dropping to around 11% of the maximum by September 2022. This could suggest that rather than just being a measure of Wordle's popularity, the tweet volume is also a factor in popularity. As more people share their scores on Twitter, more people are exposed to the game, and would be more likely to search for the game online, or even play the game. The more significant decline in tweet volume as compared to search volume suggests that sharing Wordle scores online has become less popular, with Wordle itself not necessary having as significant of a decline. Wordle is no longer nearly as viral on Twitter as it once was during its peak, with such tweets less common as it becomes less of a popular thing to share.
For further insight, we can look at the number of (unique) users sharing their game everyday, and compare the proportion of new and repeat users.
fig = px.area(
twt_users_all.divide(twt_users_all['total'], axis=0) * 100,
x=index,
y=['new', 'repeat'],
labels={'variable':'Users'},
title='Proportion of new and repeat users posting Wordle scores over time'
)
fig.update_yaxes(title='Proportion of users (%)',fixedrange=False)
fig.update_xaxes(
title='Date',
rangeslider_visible=True,
rangeslider_yaxis_rangemode='auto'
)
fig.update_traces(hovertemplate='%{y:.2f}%')
fig.update_layout(hovermode='x unified')
fig.show()
We can see that the proportion of new users posting Wordle scores on Twitter has generally been declining since the game's release. The inital increase in users and tweets is explained by the relatively high proportion of new users near the start. As the proportion of new users declines, the growth slows, and eventually becomes negligible. After February 2022, the peak in Wordle's popularity, the proportion of new users drops to below 5% on most days, with only around 1% being new users after May 2022. This is an indication that there is much less interest or hype surrounding the game, with less people being interested to start playing the game. In addition, this shows that the decline in plays is primarily due to existing players losing interest and no longer playing, as the affect of new users is inconsequential. However, it could also just mean that sharing Wordle scores is no longer as popular, but this would also decrease people's exposure to the game as they see less Wordle scores.
Besides interest over time, Google Trends also has data on the search interest by region, which is useful for seeing where a search term is most popular. The search interest by region is a measure relative to the total number of Google searches in the region at a specific time.
m = folium.Map(tiles='CartoDB positron')
folium.Choropleth(
geo_data='data/geo/countries.geojson',
name="choropleth",
columns=['alpha3','Wordle'],
data=regional,
key_on="feature.properties.ISO_A3",
fill_color="YlGnBu",
legend_name='Search interest',
).add_to(m)
m
For clarity, I have also displayed the top values in a table:
regional[['geoName','Wordle']].rename(
columns={'geoName':'country','Wordle':'search interest'}
).sort_values(by='search interest',ascending=False).head(10)
| country | search interest | |
|---|---|---|
| 92 | Guernsey | 100 |
| 155 | New Zealand | 97 |
| 106 | Ireland | 96 |
| 107 | Isle of Man | 81 |
| 112 | Jersey | 73 |
| 235 | United Kingdom | 69 |
| 38 | Canada | 61 |
| 85 | Gibraltar | 56 |
| 12 | Australia | 55 |
| 236 | United States | 53 |
This indicates that Wordle is most popular in the countries New Zealand, x and x.
We can also look at the tweets of Wordle games to compare by country. Twitter has a (optional) location field where users can specify their location. This can be used to classify users by country.
m = folium.Map(tiles='CartoDB positron')
folium.Choropleth(
geo_data='data/geo/countries.geojson',
name="choropleth",
columns=['alpha3','log'],
data=country_users,
key_on="feature.properties.ISO_A3",
fill_color="YlGnBu",
legend_name='Log10 of the number of users'
).add_to(m)
m
By looking at the number of users, USA has by far the most people sharing Wordle games on Twitter. However, there are some limitations to this analysis. The location data is derived from the location field of the user, which is simply a text field where users can input anything. As such, the locations may not be exactly accurate or complete, and may not be representative of the actual distribution of countries (i.e. some countries may be more likely to specify location on Twitter).
We can then take the number of users as a proportion of the number of internet users in the country to get an estimate of the proportion of people in a country who post Wordle games on Twitter.
m = folium.Map(tiles='CartoDB positron')
folium.Choropleth(
geo_data='data/geo/countries.geojson',
name="choropleth",
columns=['alpha3','relative'],
data=country_relative[
# remove outliers
country_relative['relative'] < country_relative['relative'].quantile(0.999)
],
key_on="feature.properties.ISO_A3",
fill_color="YlGnBu",
legend_name='Proportion of internet users in country'
).add_to(m)
m
This yields a similar result to the Google Trends map, with countries like New Zealand and Ireland with higher proportions. Other countries like, USA, UK, and Canada also have relatively high proportions of people posting Wordle games on Twitter. This is not necessary a direct measure of how popular Wordle is in each country, as the popularity of Twitter varies among countries, and people in some countries may be more likely to post scores on Twitter. However, this still provides a decent estimate of Wordle's popularity by country.
4 is a Wordle par. 3 a birdie. 2 an eagle. 5 a bogey. 6 a double bogey. After 22 days I'm at -5.
— Tim Urban (@waitbutwhy) January 26, 2022
scores_count_total = scores_count.groupby('score')['count'].sum()
fig = px.bar(
scores_count_total, # Total count of each score
x='count',
text_auto = '.3s',
title='Total number of tweets of each Wordle score'
)
fig.update_traces(
hovertemplate=None,
marker_color=['#0a5e03','#50ba47','#b1f073','#f7f75c','#e8ac15','#cf4911','#851101'],
marker_line=dict(width=1.2, color='black')
)
fig.update_layout(hovermode='y')
fig.show()
Having a score of 4 is by far the most common, with 8.38M tweets with a score of 4. This matches the general sentiment that a score of 4 is normal, or average, or unspectacular, with lower scores considered good, and higher scores considered poor. A score of 3 is the next most common with 6.20M tweets, and with scores of 5 close behind at just below 6M tweets. The majority of tweets fall within this score range, with the other scores often being regarded as just poor or great luck. Having a score of 1 is by far the rarest, with only 188k tweets. In general, 4 is most common score, with those above 4 being more common than those below, which makes intuitive sense as it would take more skill and luck to get lower scores.
df_twitter[df_twitter['score']>0]['score'].describe().to_frame()
| score | |
|---|---|
| count | 2.560955e+07 |
| mean | 4.081433e+00 |
| std | 1.137395e+00 |
| min | 1.000000e+00 |
| 25% | 3.000000e+00 |
| 50% | 4.000000e+00 |
| 75% | 5.000000e+00 |
| max | 6.000000e+00 |
The median and mode score are both 4, and the mean score of all tweets is 4.08. Hence, we can conclude that the average wordle game would have a score of 4. The IQR of 3 to 5 also shows that the majority of games yield such a score.
fig = px.scatter(
x=index,
y=twt_mean['mean'],
#mode='lines+markers',
#name='Mean score',
custom_data=[twt_mean['answers']],
marginal_y='box',
labels=dict(x="Date", y="Mean score")
)
fig.update_traces(
hovertemplate='<b>%{customdata[0]}</b><br>Mean score: %{y}<br><extra></extra>',
selector=dict(type='scatter')
)
fig.update_layout(
title_text='Average (mean) score of Wordle games over time',
#hovermode='x unified',
#selector=dict(type='scatter')
)
fig.show()
As shown, the average scores of each wordle game generally lies around 4. More specifically, the median of the average score of each Wordle is 4.12.
fig = px.scatter(
ans_freq,
y='zipf', x='mean', custom_data=['answers'],
marginal_x='histogram',
marginal_y='histogram',
title='zipf frequency vs average score of word'
)
fig.update_traces(hovertemplate='<b>%{customdata[0]}</b><br>Mean score: %{x}<br>zipf: %{y}',selector=dict(type='scatter'))
stats.pearsonr(ans_freq['mean'], ans_freq['zipf'])
(-0.35770111812269373, 5.8445702885770625e-09)
ans_unique = twt_mean.merge(answers['answers'].apply(set).str.len().rename('unique'), how='inner', left_index=True, right_index=True)
fig = px.scatter(
ans_unique,
y='unique', x='mean',custom_data=['answers'],
marginal_x='histogram',
marginal_y='histogram',
title='Number of unique letters vs average score of word',
trendline='ols'
)
fig.update_traces(hovertemplate='<b>%{customdata[0]}</b><br>Mean score: %{x}<br>Unique letter: %{y}',selector=dict(type='scatter'))
stats.pearsonr(ans_unique['mean'], ans_unique['unique'])
(-0.4254767213877685, 2.056215220036592e-12)
Both the word frequency and number of unique letters have an effect on the average score of a Wordle. Higher word frequency implies that the word is more likely to be known, which decreases the average number of guesses needed. Have repeated letters may be less expected to players and harder to guess, which can result in a higher number of guesses. However, while there is some correlation between the factors and the average difficultly, the correlation is not strong.
ans_factors = ans_freq.merge(ans_unique)
mlm = LinearRegression()
x_train, x_test, y_train, y_test = train_test_split(ans_factors[['zipf','unique']], ans_factors['mean'], test_size=0.3, random_state=0)
mlm.fit(x_train, y_train)
y_hat = mlm.predict(x_test)
sns.kdeplot(y_test, color='r', label='Actual Value')
sns.kdeplot(y_hat, color='b', label='Fitted Value')
plt.legend()
plt.show()
Evidently, it is not very accurate to use these two factors alone to predict score. While the mode value is similar, the density is quite far off.
As a result of human nature, this result is not unexpected. Logically there is some correlation, but due to human nature, the correlation is not strong. As different people attempt the same wordle, they have different skill levels, different approaches to solving, etc. As such, the average score is not necessarily a good measure of difficultly. Taking this into consideration, I believe it is fair to consider the (mild) correlation between the factors and the average score as a decent measure for difficultly.
To get an idea of the most common starting word, we will look at the starting words used in the Reddit comments.
wordcloud = WordCloud(width = 1600, height = 900,
background_color ='white',
min_font_size = 10).generate_from_frequencies(r_scores['word'].value_counts())
fig = px.imshow(wordcloud, title='Word cloud of the most popular starting words on Reddit')
fig.update_layout(margin=dict(l=10,r=10,t=40,b=10),hovermode=False)
fig.update_xaxes(showticklabels=False)
fig.update_yaxes(showticklabels=False)
fig.show()
It can be seen that the most used starting word on Reddit is 'stare', followed by 'crane'. Many of these words contain multiple vowels, with a and e being especially common.
To more clearly show this, I have displayed the top 20 most popular starting words below with their count.
r_scores['word'].value_counts().rename('count').to_frame().head(20)
| count | |
|---|---|
| stare | 3085 |
| crane | 1616 |
| raise | 1387 |
| adieu | 1143 |
| crate | 966 |
| audio | 897 |
| slate | 834 |
| trace | 731 |
| arise | 709 |
| salet | 632 |
| roate | 576 |
| tares | 506 |
| house | 490 |
| stern | 472 |
| train | 430 |
| tears | 425 |
| crwth | 401 |
| soare | 395 |
| irate | 392 |
| stear | 387 |
With more time and further analysis, I'm sure a more in-depth look into Wordle can be acheived. For instance, more variables such as the amount of interactions on social media posts can be considered to provide more insight on trends and patterns. More detailed analysis, modelling and simulations can also be used to consider the difficultly of words, and the effects of different starting words and stategies. However, without data from Wordle itself, there is a limit to how accurate such analysis can be.